Finite Difference Flow Optimization

RL Post-Training for Flow-Based Image Generators

Before training: Cat floating inside the international space station

After training: Cat floating inside the international space station

Results from before and after our human preference RL post-training. No CFG is used in either image.

David McAllister Miika Aittala Tero Karras Janne Hellsten Angjoo Kanazawa Timo Aila Samuli Laine

In this project, we set out to find a simple, grounded RL post-training method for diffusion image generators. We made a few observations about the structure of diffusion flows that lead to Finite Difference Flow Optimization (FDFO), a new online RL algorithm that reduces variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. In our experiments, FDFO converges faster, reaches higher rewards, and produces fewer artifacts than current solutions.

Diffusion models fit the distribution of their training data conservatively. This means they assign non-zero likelihood to images that are near the data manifold but not on it, resulting in a distribution that envelops the natural image manifold but produces many low-quality samples. This sort of smoothing isn’t entirely bad, since it allows the model to generate unseen generalizations of training images, but it also produces the washed-out, artifacted outputs you get from a base model.

Figure: Left: the true data distribution (a fractal tree). Right: the conservatively fit distribution learned by a diffusion model, which assigns non-zero density to regions near but not on the data manifold.

Base model sample without CFG: bird sitting in the rim of a tire

Base model sample without CFG: retired Wile E. Coyote having fun

Base model sample without CFG: fluffy chick in an antique coffee cup

Base model sample without CFG: group of giraffes standing around each other

Figure: Samples from Stable Diffusion 3.5 Medium without CFG or post-training.

This motivates the use of methods to sharpen the distribution. The dominant approach is classifier-free guidance (CFG) that concentrates mass toward higher-density modes at inference time. This gives you a knob to trade diversity for sample quality. But it’s a crude band-aid: you can adjust its strength but not the style or quality it imparts on the images, typically it results in an over-saturated look. RL-based post-training is an appealing alternative that produces the same useful tradeoff while steering the model toward reward-specified behavior.

Why Apply RL to Image Generation?

In LLMs, RL post-training is a standard tool today. Pretraining produces a model with broad coverage but a loose, unfocused output distribution. Post-training techniques like RLHF and RLVR then tighten that distribution onto useful outputs, which has driven major wins in alignment and reasoning. In image generation, CFG analogously concentrates mass from a pretrained distribution into desirable regions. Because of CFG’s dominance, RL adoption in image generation has been slower than in language models. However, RL has recently been gaining popularity since it can be steered with arbitrary rewards from learned models or human preferences.

It isn’t obvious how to apply RL to diffusion and flow models. Standard RL machinery was built for domains where model likelihoods are easily accessible, but these are notoriously difficult to estimate in continuous flows. Most existing approaches work around this by forcing the flow into a Markov decision process (MDP) framework with ungrounded proxy likelihoods. A natural question is whether we can ascend the reward gradient, the stated goal of policy gradient methods, in a more flow-native way.

Previously, I (David) worked on applying policy gradients to flow matching for continuous control with FPO, and for this project I joined the NVIDIA Helsinki team, which has recently focused on analyzing and improving the pretraining recipe for diffusion image generators (EDM, EDM2). FDFO is a simple first step toward an improved recipe for RL post-training of image generators.

Setup: Image Generator Post-Training

To introduce our method, let’s establish the pieces of our post-training loop. A flow matching model generates images by starting from random noise and taking a sequence of denoising steps to arrive at a clean image. Suppose we have a reward model, say, a vision-language model (VLM) that we ask Does this image match the caption [CAPTION]? Answer Yes or No. We can then use the likelihood of the ‘Yes’ token as a scalar reward. Our goal is to update the flow model so that it produces images that score higher under this reward (i.e., match the caption). By optimizing this reward, we’re pushing the model’s output distribution to concentrate on images that match the prompt, which is exactly the kind of distribution sharpening we’re hoping to get from post-training. This is natural to formulate as an RL problem.

Figure: A post-training epoch. Prompt and noise are sampled, the current flow model generates an image, a reward function scores it, and FDFO updates the model.

Background: The Denoising MDP

Currently, the most popular algorithm for RL post-training image generators is the denoising Markov decision process (MDP). Introduced in DDPO and adopted by Flow-GRPO and DanceGRPO, it casts the multi-step sampling process as an MDP, where each denoising step is a separate action. While sampling images (rollouts), stochastic noise is injected at each step. If the corresponding image scores well, the random perturbations along that trajectory are reinforced. This means pulling the flow toward the perturbations if the image received high reward and away if it received low reward.

Figure 1: Comparison of Flow-GRPO (MDP approach) and FDFO (our method)

Figure: Illustration of differences between denoising MDPs (DDPO, Flow-GRPO) and our method. Our method provides consistent reward-ascending updates across the sampling chain. See the paper for more details.

Standard policy gradient methods like PPO require likelihoods for each action. In a flow model, the velocity at each step is continuous, and computing its exact likelihood under the flow is infeasibly expensive. The denoising MDP recasts each step’s perturbation as a sample from a Gaussian, which provides a proxy likelihood for PPO. This is simple, but it doesn’t reflect how likelihoods actually evolve under the flow (e.g., how the flow expands and contracts volume locally).

Credit assignment is also challenging under this formulation. When a trajectory leads to high reward, each perturbation along it gets reinforced, even though most were irrelevant to increasing reward. Much of the update is noise that pushes the flow in reward-neutral directions.

We decided to build an algorithm outside the denoising MDP framework and started by just looking for the most direct way to push a flow model’s outputs toward higher reward.

The Direct Approach

For now, let’s assume the reward is differentiable (we’ll relax this later). Then, we can backpropagate through the VLM to get a gradient in image space. This represents the direction to shift the output image to produce greater reward. Flow models generate images through multiple denoising steps, each with its own predicted velocity, so we need to translate our image-space gradient into update directions for every velocity prediction along the sampling chain. The obvious way to do this is to backpropagate through the chain itself.

It’s worth analyzing what the backpropagation is doing mechanically. Each step’s Jacobian transforms the gradient one step backward through the chain. Composing all of them brings the image-space gradient all the way up to the initial noise.

The plot below shows this for a 2D flow, where we can visualize how different gradient directions transform along the chain. We sample an initial noise $x_0$ and transport it to the data distribution (in this case, a fractal distribution) by integrating the flow step-by-step. A reward gradient (computed through a differentiable reward) is defined at the data sample and then carried up the sampling chain by multiplying by each step’s Jacobian.

Existing methods have explored and refined variants of this for finetuning flow models, but problems remain. The gradient has to traverse up each denoising step via the chain rule. Each step’s Jacobian stretches, squeezes and rotates the gradient in poorly controlled ways. By the time it reaches early steps, high frequencies are suppressed exponentially and some axes have exploded while others have vanished. Many of these effects cannot be seen in a 2D toy figure but make the optimization extremely difficult on real image generators. Also, this assumes access to a differentiable reward model, which is a subset of the general RL case. In practice, this rules out rewards extracted from the real world, including human feedback.

The chain of Jacobians is useful despite its flaws, since it gives us update directions for the velocity at every step, which we can train into the model. We propose a simple approximation that produces these per-step updates while avoiding the gradient problems, based on the following observations about the geometry of the optimization and underlying flow:

An update doesn’t have to exactly match the gradient to be useful. More precisely, a valid ascent direction is any direction at an angle of less than 90° from the true gradient, i.e., it lies in the same half-space. These ascent directions will still locally improve the reward, albeit less than the steepest direction. In fact, it’s a standard practice to update along directions that deviate from the true gradient even when it’s available. Adam rescales the gradient per-coordinate, while Muon rebalances its principal components. Both lead to better optimization properties than naive SGD.
Flow matching models are tuned to produce flows with minimal curvature, since those require fewer steps at inference time. This straightness property is useful for us. Intuitively, if the flow is straight, then a gradient computed on the final sample will be a valid ascent direction for intermediate steps along the trajectory. Mathematically, this corresponds to the Jacobian of the flow being positive semi-definite: it can stretch and scale, but it doesn’t rotate a gradient more than 90°. Optimal transport flows, the theoretical ideal, have exactly positive semi-definite Jacobians. While learned flows aren’t equivalent to optimal transport, it is still a reasonable approximate mental model.

A Simple Alternative

This motivates a simple alternative: instead of backpropagating through the sampling chain, just copy the image-space gradient up to all earlier steps. The argument from the previous section suggests this is a valid direction at each step, as long as the flow isn’t too rotational. Here we show the backpropagated gradient next to a simple copy:

Perhaps surprisingly, we find this works better than the exact backpropagated gradient. The copied gradient is biased, but it has dramatically better conditioning. It preserves the full frequency spectrum at every step, providing a clean image-aligned training signal, just like in pretraining. Our ablations confirm this. Using the true reward gradient applied uniformly to all steps (light brown in the figure below and row I in paper figure 8) performs comparably to our full method. However, backpropagating that gradient through the sampling trajectory (dark brown, row J figure 8) is significantly worse.

Extending to Non-Differentiable Rewards

Everything so far still requires a differentiable reward model to compute the image-space gradient. We want to make our method compatible with any scalar reward, including human feedback, external tools and black-box metrics, which is the general RL scenario.

So far, we’ve assumed that the reward is differentiable. To handle any scalar reward, we use a simple mechanism: generate a pair of images from the same starting noise but introduce stochasticity during sampling so that they diverge into two similar, counterfactual images.

Paired counterfactual images from shared noise

Figure: A pair of counterfactual images generated from the same initial noise with slight stochasticity. The two images share overall composition but differ in their details. We can estimate a reward gradient direction by comparing them head-to-head.

One image typically yields a higher reward than the other. The difference between the two images, weighted by their reward difference, gives us a direction that points from the worse image toward the better one. This is effectively a finite difference that tells us how reward varies along the direction the two images differ. We use this as our update direction and apply it uniformly across the flow, just as we did with the differentiable reward gradient. The paper analyzes this more rigorously, but the method itself is as simple as that.

FDFO Algorithm

Sample two images from the same initial noise with slight stochasticity.
Evaluate the reward for each image.
Multiply the image difference by the reward difference.
Update the model's velocity prediction toward this direction at each noise level.

Highlighted Results

We tested FDFO by post-training Stable Diffusion 3.5 on three rewards: PickScore, VLM alignment, and their combination. PickScore is the easiest to optimize and both methods do well. VLM alignment is a noticeably harder reward, however, and FDFO pulls ahead significantly. On VLM alignment and the combined reward, FDFO reaches reward levels that Flow-GRPO is unable to attain.

Figure: Reward convergence curves (adapted from Fig. 2 in the paper). FDFO (blue) converges faster and to higher reward than Flow-GRPO (red) across all three reward types. Use the tabs to switch between Combined, PickScore-only, and VLM-only reward. Hover for exact values.

The quality of training also differs. Flow-GRPO periodically produces grid-like artifacts during extended training which fade in and out but can be severe. We don’t observe these with our method, even after equally long training. Flow-GRPO also shows consistent style drift across prompts, which we attribute to its noisier updates causing random mutation in reward-irrelevant dimensions.

Before post-training, the base model’s output quality without CFG is poor. Enabling CFG dramatically improves alignment and quality, which is why it’s standard practice. After post-training with our method, the model produces high-quality, well-aligned images on its own, without CFG. Re-enabling CFG at that point mainly reduces diversity and introduces the characteristic high-contrast look, with debatable benefit. Interestingly, our method reaches similar reward levels regardless of CFG scale.

(a) Original model, no CFG

(b) Original model, CFG enabled

(c) 80 epochs of post-training, no CFG

(d) 80 epochs of post-training, CFG enabled

Prompt: "A cat wearing ski goggles is exploring in the snow." (a) Without CFG, the original model has high diversity but poor quality and alignment. (b) Enabling CFG flips those axes. (c) Our RL post-training achieves similar diversity and alignment with more detail. (d) Enabling CFG simplifies images and reduces diversity slightly.

Independent metrics support these findings. OneIG-Bench has its own prompt alignment metric, so it measures the same goal without being directly hackable by the RL. Here, the combined reward significantly outperforms PickScore alone, and FDFO outperforms Flow-GRPO under both. HPSv2 human preference scores tell a similar story, with the combined reward and FDFO coming out on top. Across all rewards, RL post-training consistently reduces output diversity, just as CFG does. Both are trading diversity for alignment and quality. The difference is that RL gives you full control over this tradeoff through reward design.

Grad Student Reward Function

Everything so far has used VLM rewards, which tend to prefer a very “AI” look when you ask them to score images quality. We wanted to try something more interesting: human preferences, directly on-policy.

Screenshot of the FDFO human feedback interface

Screenshot of the simple interface used to compare each pair of generated images for interactive RL.

FDFO is well-suited to this. It doesn't require differentiable rewards, and humans aren't differentiable. It's on-policy and sample-efficient, so you can watch the model respond to your feedback in near real-time. And because the method already compares pairs of images, it maps naturally onto a simple UX: show the rater two images at a time, and let them pick the better one.

We built a simple web GUI that generates a batch of image pairs, lets the user select their preferred image from each pair (or indicate no preference), and then trains on the batch for about 30 seconds before serving the next round. Starting from Stable Diffusion 3.5 Medium with a LoRA adapter and no CFG, we ran 50 epochs of this, which took about four hours for training and labeling on a single node. The images began as expected for a base model without CFG. That is, washed out and with poor structure. Over the epochs, they evolved into sharp, realistic images that adhered to the prompt. I tend to prefer a photorealistic style, and the model learned to match that. The videos below show seed-matched generations across the 50 training epochs.

A vase of flowers

The little prince and the fox

Friendly owl on a pile of books

Drawing of a smiling cat

Baby bird

Dog at an open door

Each video fixes the initial sampling noise and shows how the resulting image changes over RL training time.

One practical finding from this experiment: updating the model uniformly at all noise levels is important. It allows the RL to correct both low-frequency structure (composition, layout) and high-frequency detail (textures, sharpness) simultaneously. We observed this with VLM rewards too, but it showed up much more prominently with this higher quality reward signal. Here are some enlarged images where you can see details added during post-training:

Before training: Disheveled owl on a pine tree

After training: Disheveled owl on a pine tree

Selected images generated from the same seed for the base model (left) and RL post-trained model (right).

What made this work well in practice was the on-policy RL loop: at every step, I was looking at what the current model actually produces and nudging it in the direction I preferred. The model’s outputs improved visibly between rounds, which made the process feel manageable even over hundreds of decisions. This kind of iterative shaping of a model’s output distribution is one of the things RL is uniquely good at. In total, this required labeling just 3,200 pairs of images and imparted a visual style on the model that matched my preferences. It’s exciting to think about how this could be extended to shape a model’s aesthetic at scale, or even personalized to individual users.

Takeaways

Our main takeaway from this project is that the structure of the generative process matters for how you do RL on it. We found that working with this structure rather than against it led to a simpler algorithm that performs better.

A specific instance of this: policy gradient methods rely on the gradient of the log-likelihood, not the likelihood itself. Likelihoods are notoriously hard to estimate in flow models, so we decided not to try. This opened up a more direct path to the reward gradient. We suspect there are more wins like this, where the properties of flows make certain things easier than the imported RL framework assumes.

There are still important open questions. Retaining diversity during RL optimization is difficult, and KL regularization isn’t a satisfying solution. Also, aspects of image quality not covered by the reward can quietly degenerate, so making rewards comprehensive enough to avoid early stopping is a challenge in its own right. We see FDFO as a clean starting point for working on these problems.