Finite Difference Flow Optimization

RL Post-Training for Flow-Based Image Generators

David McAllister Miika Aittala Tero Karras Janne Hellsten Angjoo Kanazawa Timo Aila Samuli Laine

In this project, we set out to find a simple, grounded RL post-training method for diffusion image generators. We made a few observations about the structure of diffusion flows that lead to Finite Difference Flow Optimization (FDFO), a new online RL algorithm that reduces variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. In our experiments, FDFO converges faster, reaches higher rewards, and produces fewer artifacts than current solutions.

Diffusion models fit the distribution of their training data conservatively. This means they assign non-zero likelihood to images that are near the data manifold but not on it, resulting in a distribution that envelops the natural image manifold but produces many low-quality samples. This sort of smoothing isn’t entirely bad, since it allows the model to generate unseen generalizations of training images, but it also produces the washed-out, artifacted outputs you get from a base model.

Figure: Left: the true data distribution (a fractal tree). Right: the conservatively fit distribution learned by a diffusion model, which assigns non-zero density to regions near but not on the data manifold.

Paired counterfactual images from shared noise

Figure: Samples from a large pretrained model (Stable Diffusion 3.5) without classifier-free guidance (CFG) or post-training. Prompt: "An old man in a colorful smock splattering paint alongside a young man in an oversized t-shirt and goggles."

This motivates the use of methods to sharpen the distribution. The dominant approach is classifier-free guidance (CFG) that concentrates mass toward higher-density modes at inference time. This gives you a knob to trade diversity for sample quality. But it’s a crude band-aid: you can adjust its strength but not the style or quality it imparts on the images, typically it results in an over-saturated look. RL-based post-training is an appealing alternative that produces the same useful tradeoff while steering the model toward reward-specified behavior.

Why Apply RL to Image Generation?

In LLMs, RL post-training is a standard tool today. Pretraining produces a model with broad coverage but a loose, unfocused output distribution. Post-training techniques like RLHF and RLVR then tighten that distribution onto useful outputs, which has driven major wins in alignment and reasoning. In image generation, CFG analogously concentrates mass from a pretrained distribution into desirable regions. Because of CFG’s dominance, RL adoption in image generation has been slower than in language models. However, RL has recently been gaining popularity since it can be steered with arbitrary rewards from learned models or human preferences.

It isn’t obvious how to apply RL to diffusion and flow models. Standard RL machinery was built for domains where model likelihoods are easily accessible, but these are notoriously difficult to estimate in continuous flows. Most existing approaches work around this by forcing the flow into a Markov decision process (MDP) framework with ungrounded proxy likelihoods. A natural question is whether we can ascend the reward gradient, the stated goal of policy gradient methods, in a more flow-native way.

Previously, I (David) worked on applying policy gradients to flow matching for continuous control with FPO, and for this project I joined the NVIDIA Helsinki team, which has recently focused on analyzing and improving the pretraining recipe for diffusion image generators (EDM, EDM2). FDFO is a simple first step toward an improved recipe for RL post-training of image generators.

Setup: Image Generator Post-Training

To introduce our method, let’s establish the pieces of our post-training loop. A flow matching model generates images by starting from random noise and taking a sequence of denoising steps to arrive at a clean image. Suppose we have a reward model, say, a vision-language model (VLM) that we ask Does this image match the caption [CAPTION]? Answer Yes or No. We can then use the likelihood of the ‘Yes’ token as a scalar reward. Our goal is to update the flow model so that it produces images that score higher under this reward (i.e., match the caption). By optimizing this reward, we’re pushing the model’s output distribution to concentrate on images that match the prompt, which is exactly the kind of distribution sharpening we’re hoping to get from post-training. This is natural to formulate as an RL problem.

Figure: The generation pipeline. Noise is transformed into an image through T denoising steps (each applying the velocity network), then scored by a reward model (VLM or PickScore) to produce a scalar reward.

Background: The Denoising MDP

Currently, the most popular algorithm for RL post-training image generators is the denoising Markov decision process (MDP). Introduced in DDPO and adopted by Flow-GRPO and DanceGRPO, it casts the multi-step sampling process as an MDP, where each denoising step is a separate action. While sampling images (rollouts), stochastic noise is injected at each step. If the corresponding image scores well, the random perturbations along that trajectory are reinforced. This means pulling the flow toward the perturbations if the image received high reward and away if it received low reward.

Figure 1: Comparison of Flow-GRPO (MDP approach) and FDFO (our method)

Figure: Illustration of differences between denoising MDPs (DDPO, Flow-GRPO) and our method. Our method provides consistent reward-ascending updates across the sampling chain. See the paper for more details.

Standard policy gradient methods like PPO require likelihoods for each action. In a flow model, the velocity at each step is continuous, and computing its exact likelihood under the flow is infeasibly expensive. The denoising MDP recasts each step’s perturbation as a sample from a Gaussian, which provides a proxy likelihood for PPO. This is simple, but it doesn’t reflect how likelihoods actually evolve under the flow (e.g., how the flow expands and contracts volume locally).

Credit assignment is also challenging under this formulation. When a trajectory leads to high reward, each perturbation along it gets reinforced, even though most were irrelevant to increasing reward. Much of the update is noise that pushes the flow in reward-neutral directions.

We decided to build an algorithm outside the denoising MDP framework and started by just looking for the most direct way to push a flow model’s outputs toward higher reward.

The Direct Approach

For now, let’s assume the reward is differentiable (we’ll relax this later). Then, we can backpropagate through the VLM to get a gradient in image space. This represents the direction to shift the output image to produce greater reward. Flow models generate images through multiple denoising steps, each with its own predicted velocity, so we need to translate our image-space gradient into update directions for every velocity prediction along the sampling chain. The obvious way to do this is to backpropagate through the chain itself.

It’s worth analyzing what the backpropagation is doing mechanically. Each step’s Jacobian transforms the gradient one step backward through the chain. Composing all of them brings the image-space gradient all the way up to the initial noise.

The plot below shows this for a 2D flow, where we can visualize how different gradient directions transform along the chain. We sample an initial noise $x_0$ and transport it to the data distribution (in this case, a fractal distribution) by integrating the flow step-by-step. A reward gradient (computed through a differentiable reward) is defined at the data sample and then carried up the sampling chain by multiplying by each step’s Jacobian.

Existing methods have explored and refined variants of this for finetuning flow models, but problems remain. The gradient has to traverse up each denoising step via the chain rule. Each step’s Jacobian stretches, squeezes and rotates the gradient in poorly controlled ways. By the time it reaches early steps, high frequencies are suppressed exponentially and some axes have exploded while others have vanished. Many of these effects cannot be seen in a 2D toy figure but make the optimization extremely difficult on real image generators. Also, this assumes access to a differentiable reward model, which is a subset of the general RL case. In practice, this rules out rewards extracted from the real world, including human feedback.

The chain of Jacobians is useful despite its flaws, since it gives us update directions for the velocity at every step, which we can train into the model. We propose a simple approximation that produces these per-step updates while avoiding the gradient problems, based on the following observations about the geometry of the optimization and underlying flow:

An update doesn’t have to exactly match the gradient to be useful. More precisely, a valid ascent direction is any direction at an angle of less than 90° from the true gradient, i.e., it lies in the same half-space. These ascent directions will still locally improve the reward, albeit less than the steepest direction. In fact, it’s a standard practice to update along directions that deviate from the true gradient even when it’s available. Adam rescales the gradient per-coordinate, while Muon rebalances its principal components. Both lead to better optimization properties than naive SGD.
Flow matching models are tuned to produce flows with minimal curvature, since those require fewer steps at inference time. This straightness property is useful for us. Intuitively, if the flow is straight, then a gradient computed on the final sample will be a valid ascent direction for intermediate steps along the trajectory. Mathematically, this corresponds to the Jacobian of the flow being positive semi-definite: it can stretch and scale, but it doesn’t rotate a gradient more than 90°. Optimal transport flows, the theoretical ideal, have exactly positive semi-definite Jacobians. While learned flows aren’t equivalent to optimal transport, it is still a reasonable approximate mental model.

A Simple Alternative

This motivates a simple alternative: instead of backpropagating through the sampling chain, just copy the image-space gradient up to all earlier steps. The argument from the previous section suggests this is a valid direction at each step, as long as the flow isn’t too rotational. Here we show the backpropagated gradient next to a simple copy:

Perhaps surprisingly, we find this works better than the exact backpropagated gradient. The copied gradient is biased, but it has dramatically better conditioning. It preserves the full frequency spectrum at every step, providing a clean image-aligned training signal, just like in pretraining. Our ablations confirm this. Using the true reward gradient applied uniformly to all steps (light brown in the figure below and row I in paper figure 8) performs comparably to our full method. However, backpropagating that gradient through the sampling trajectory (dark brown, row J figure 8) is significantly worse.

Extending to Non-Differentiable Rewards

Everything so far still requires a differentiable reward model to compute the image-space gradient. We want to make our method compatible with any scalar reward, including human feedback, external tools and black-box metrics, which is the general RL scenario.

So far, we’ve assumed that the reward is differentiable. To handle any scalar reward, we use a simple mechanism: generate a pair of images from the same starting noise but introduce stochasticity during sampling so that they diverge into two similar, counterfactual images.

Figure: A pair of counterfactual images generated from the same initial noise with slight stochasticity. The two images share overall composition but differ in fine details, allowing their reward difference to estimate a gradient direction.

One image typically yields a higher reward than the other. The difference between the two images, weighted by their reward difference, gives us a direction that points from the worse image toward the better one. This is effectively a finite difference that tells us how reward varies along the direction the two images differ. We use this as our update direction and apply it uniformly across the flow, just as we did with the differentiable reward gradient. The paper analyzes this more rigorously, but the method itself is as simple as that.

FDFO Algorithm

Sample two images from the same initial noise with slight stochasticity.
Evaluate the reward for each image.
Multiply the image difference by the reward difference.
Update the model's velocity prediction toward this direction at each noise level.

Highlighted Results

We tested FDFO by post-training Stable Diffusion 3.5 on three rewards: PickScore, VLM alignment, and their combination. PickScore is the easiest to optimize and both methods do well. VLM alignment is a noticeably harder reward, however, and FDFO pulls ahead significantly. On VLM alignment and the combined reward, FDFO reaches reward levels that Flow-GRPO is unable to attain.

Figure: Reward convergence curves (adapted from Fig. 2 in the paper). FDFO (blue) converges faster and to higher reward than Flow-GRPO (red) across all three reward types. Use the tabs to switch between Combined, PickScore-only, and VLM-only reward. Hover for exact values.

The quality of training also differs. Flow-GRPO periodically produces grid-like artifacts during extended training which fade in and out but can be severe. We don’t observe these with our method, even after equally long training. Flow-GRPO also shows consistent style drift across prompts, which we attribute to its noisier updates causing random mutation in reward-irrelevant dimensions.

Before post-training, the base model’s output quality without CFG is poor. Enabling CFG dramatically improves alignment and quality, which is why it’s standard practice. After post-training with our method, the model produces high-quality, well-aligned images on its own, without CFG. Re-enabling CFG at that point mainly reduces diversity and introduces the characteristic high-contrast look, with debatable benefit. Interestingly, our method reaches similar reward levels regardless of CFG scale.

(a) Original model, no CFG

(b) Original model, CFG enabled

(c) 80 epochs of post-training, no CFG

(d) 80 epochs of post-training, CFG enabled

Prompt: "A cat wearing ski goggles is exploring in the snow." (a) Without CFG, the original model has high diversity but poor quality and alignment. (b) Enabling CFG flips those axes. (c) Our RL post-training achieves similar diversity and alignment with more detail. (d) Enabling CFG simplifies images and reduces diversity slightly.

Independent metrics support these findings. OneIG-Bench has its own prompt alignment metric, so it measures the same goal without being directly hackable by the RL. Here, the combined reward significantly outperforms PickScore alone, and FDFO outperforms Flow-GRPO under both. HPSv2 human preference scores tell a similar story, with the combined reward and FDFO coming out on top. Across all rewards, RL post-training consistently reduces output diversity, just as CFG does. Both are trading diversity for alignment and quality. The difference is that RL gives you full control over this tradeoff through reward design.

Takeaways

FDFO came from taking a step back and thinking about the structure of flow models rather than contorting them to existing RL machinery. A key realization was that policy gradient methods use the gradient of the log-likelihood, not the likelihood itself. Letting go of the need to estimate likelihoods freed us to think about the problem in a more flow-native way. The algorithm we landed on is simple: generate paired trajectories from shared noise, compute reward-weighted image differences, and apply them uniformly as velocity targets across all steps. We see it as an important step toward better RL post-training for image generators.

Important questions remain. Retaining diversity during RL is an open problem, and we find that KL regularization is not a satisfactory solution. Aspects of image quality not covered by the reward are free to degenerate over time, so we must also design rewards comprehensive enough to avoid the need for early stopping. We see FDFO as a valuable tool for continued work on these problems.