Dissecting AI

Problem & Motivation

This paper addresses a fundamental limitation in reinforcement learning (RL) fine-tuning for flow matching models, a dominant framework for text-to-image and text-to-video generation. Flow matching models (Lipman et al., 2023; Liu et al., 2023) learn a continuous-time velocity field that transports samples from a simple source distribution (e.g., Gaussian noise) to a complex data distribution (e.g., images). While these models achieve high-quality generation, aligning them with downstream objectives (e.g., compositional accuracy, aesthetic quality) remains challenging. Prior work, such as Flow-GRPO (Liu et al., 2025) and Flow-CPS (Wang and Yu, 2025), casts the denoising process as a Markov Decision Process (MDP) and applies Proximal Policy Optimization (PPO)-style ratio clipping to enforce trust-region constraints. However, the authors argue that ratio clipping is structurally ill-suited for flow models due to its reliance on a noisy, single-sample estimate of the true policy divergence.

The core issue arises from the high-dimensional, continuous latent space of flow models. In this setting, the probability ratio between new and old policies is dominated by sampling noise, leading to a systematic bias: the mean of the ratio distribution falls below one, causing asymmetric clipping behavior. Specifically, the standard PPO clipping range $[1 - \epsilon, 1 + \epsilon]$ becomes effectively asymmetric, failing to adequately constrain over-optimization for positive-advantage samples while excessively clipping negative-advantage ones. This results in over-constraining in some regions of the trajectory and under-constraining in others, degrading both reward optimization and generalization. The authors demonstrate that this pathology is intrinsic to Gaussian policies, where the log-ratio is corrupted by a noise term whose standard deviation is of the same order as the signal itself. The research question is thus: Can we replace ratio clipping with a deterministic, exact divergence constraint to achieve more stable and efficient RL fine-tuning for flow models?

Architecture & Method

The proposed method, Flow-DPPO (Flow Divergence Proximal Policy Optimization), replaces the noisy ratio clipping mechanism in Flow-GRPO with a divergence-based mask that enforces a trust region using the exact Kullback-Leibler (KL) divergence between old and new policies. The key insight is that the per-step policy in flow models is Gaussian, enabling exact and computationally efficient computation of the KL divergence. Specifically, for Gaussian policies with fixed variance $\sigma^2(t)$ and means $\bm{\mu}_{\theta_{\text{old}}}$ and $\bm{\mu}_{\theta}$ , the KL divergence simplifies to:

D_{\text{KL}}\big(\pi_{\theta_{\text{old}}}(\cdot \mid \bm{x}_t) \| \pi_{\theta}(\cdot \mid \bm{x}_t)\big) = \frac{\|\bm{\mu}_{\theta_{\text{old}}}(\bm{x}_t, t) - \bm{\mu}_{\theta}(\bm{x}_t, t)\|^2}{2\sigma^2(t)}.

This divergence is a deterministic function of the policy parameters and can be computed from two forward passes of the velocity network, incurring no additional computational cost beyond what is already required for ratio computation. The method supports two sampling schedules: Flow-SDE and Coefficients-Preserving Sampling (CPS), both of which induce Gaussian per-step policies. For Flow-SDE, the KL divergence is:

D_{\text{KL}}^{\text{SDE}}(\pi_{\theta_{\text{old}}} \| \pi_{\theta}) = \frac{\Delta t}{2}\left(\frac{\sigma_t(1 - t)}{2t} + \frac{1}{\sigma_t}\right)^2 \|\bm{v}_{\theta}(\bm{x}_t, t) - \bm{v}_{\theta_{\text{old}}}(\bm{x}_t, t)\|^2,

where $\bm{v}_{\theta}$ is the velocity network. For CPS, the divergence is:

D_{\text{KL}}^{\text{CPS}}(\pi_{\theta_{\text{old}}} \| \pi_{\theta}) = \frac{\|\bm{\mu}_{\theta}^{\text{CPS}}(\bm{x}_t, t) - \bm{\mu}_{\theta_{\text{old}}}^{\text{CPS}}(\bm{x}_t, t)\|^2}{2(t - \Delta t)^2 \sin^2(\eta \pi / 2)}.

The Flow-DPPO mask is defined as:

M_t^i = \begin{cases} 0, & \text{if } (\hat{A}^i > 0 \text{ and } r_t^i > 1 \text{ and } D_t > \delta) \\ & \text{or } (\hat{A}^i < 0 \text{ and } r_t^i < 1 \text{ and } D_t > \delta), \\ 1, & \text{otherwise}, \end{cases}

where $D_t = D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot \mid \bm{x}_t^i) \| \pi_{\theta}(\cdot \mid \bm{x}_t^i))$ , $\hat{A}^i$ is the normalized advantage, and $r_t^i$ is the importance ratio. The mask preserves the asymmetric structure of PPO: it only blocks updates that are already moving away from the old policy (i.e., when the advantage and ratio indicate divergence). This ensures that corrective updates (moving the policy back toward the old policy) are never blocked, accelerating recovery from overshooting. The objective function for Flow-DPPO is:

\mathcal{L}^{\text{Flow-DPPO}}(\theta) = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^G \frac{1}{T}\sum_{t=0}^{T-1} \left(M_t^i \cdot r_t^i(\theta) \cdot \hat{A}^i - \beta D_{\text{KL}}(\pi_{\theta} \| \pi_{\text{ref}})\right)\right],

where $G$ is the group size, $T$ is the number of denoising steps, and $\beta$ controls the strength of the KL regularization term to prevent catastrophic forgetting. The method is theoretically justified by a policy improvement bound adapted to the finite-horizon, undiscounted MDP setting of flow models. The bound guarantees monotonic improvement when the per-step divergence is constrained, providing a rigorous foundation for the divergence-based mask.

Training & Implementation

Flow-DPPO is trained on GenEval2 (Kamath et al., 2025), a benchmark for evaluating compositional accuracy in text-to-image generation, and PickScore (Kirstain et al., 2023), an out-of-domain (OOD) dataset for assessing generalization. The training uses 300B tokens of synthetic prompts generated from the GenEval2 template. The models are fine-tuned using LoRA (Low-Rank Adaptation) with ranks $r=32$ and $r=64$ for Stable Diffusion 3.5 (SD3.5) and FLUX2-9B/FLUX.1-dev, respectively. The AdamW optimizer is employed with a learning rate of $3 \times 10^{-4}$ , following a cosine schedule with 2000 warmup steps. The batch size is 512 sequences of length 2048 tokens (approximately 1M tokens per batch). Training is conducted on NVIDIA H20 96GB GPUs, with a total computational cost of approximately 140K GPU hours for all experiments. The divergence threshold $\delta$ is set to $1 \times 10^{-7}$ for Flow-DPPO and $1 \times 10^{-6}$ for Flow-DPPO+CPS, reflecting their different KL-scaling factors. The group size for GRPO is 16, with 64 groups per epoch. The stochasticity parameter $\eta$ for CPS sampling is set to 0.8. The method also incorporates a mixed ODE-SDE sampling strategy (Li et al., 2025) to accelerate convergence, where 3 out of the first half of denoising steps are randomly selected for SDE sampling.

Results & Comparisons

Flow-DPPO achieves state-of-the-art performance across multiple base models (SD3.5, FLUX2-9B, FLUX.1-dev) and evaluation metrics. Key quantitative results include:

GenEval2 (In-Domain): Flow-DPPO+CPS achieves 51.6% on SD3.5 and 57.7% on FLUX2-9B in the multi-reward setting, outperforming Flow-GRPO (39.9% on SD3.5, 46.8% on FLUX2-9B) and Flow-CPS (44.6% on SD3.5, 47.1% on FLUX2-9B). In the single-reward setting, Flow-DPPO+CPS reaches 84.1% on SD3.5 and 92.6% on FLUX2-9B, compared to 56.6% and 84.5% for Flow-GRPO.
Out-of-Domain Generalization (PickScore): Flow-DPPO+CPS achieves 22.97 PickScore on FLUX2-9B (multi-reward), surpassing Flow-GRPO (22.62) and Flow-CPS (22.85). On SD3.5, it scores 22.51 PickScore, outperforming Flow-GRPO (22.07) and Flow-CPS (22.08).
Catastrophic Forgetting Mitigation: Flow-DPPO variants maintain lower KL divergence from the pretrained model. For FLUX2-9B in the single-reward setting, Flow-DPPO achieves a KL divergence of $0.17 \times 10^{-3}$ , compared to $0.77 \times 10^{-3}$ for Flow-GRPO and $1.07 \times 10^{-3}$ for GRPO-Guard. In the multi-reward setting, Flow-DPPO+CPS achieves $0.70 \times 10^{-3}$ , outperforming Flow-CPS ( $1.66 \times 10^{-3}$ ).
Multi-Epoch Training Stability: Flow-DPPO enables stable multi-epoch training, where baseline methods (Flow-GRPO, Flow-CPS) plateau or degrade. For example, with 64 groups and 2 inner loops (G64-I2), Flow-DPPO+CPS achieves sustained GenEval2 improvements on SD3.5, while Flow-GRPO degrades after 500 training epochs.
Ablation Studies: The asymmetric mask is critical for stability. Without it, training collapses due to ineffective trust-region regularization. A divergence threshold of $10^{-7}$ yields the best balance between stability and performance, while a looser threshold ( $10^{-5}$ ) reduces stability. The KL regularization strength $\beta = 10^{-3}$ further mitigates reward hacking, improving final GenEval2 performance without degrading OOD metrics.

Limitations & Significance

The paper identifies several limitations of Flow-DPPO. First, the method relies on the Gaussian structure of flow model policies, which may not generalize to non-Gaussian or discrete action spaces. Second, while the divergence-based mask mitigates catastrophic forgetting, it does not entirely eliminate it, particularly under classifier-free guidance (CFG). Third, the method assumes that the velocity network can be fine-tuned without destabilizing the pretrained model, which may not hold for all architectures or tasks. Despite these limitations, Flow-DPPO represents a significant advancement in RL fine-tuning for flow models. By replacing ratio clipping with an exact, deterministic divergence constraint, the method achieves superior reward optimization, improved KL-proximal efficiency, and enhanced robustness to catastrophic forgetting. The ability to enable stable multi-epoch training is particularly impactful for scenarios with expensive rollouts, such as long-video generation. The authors suggest that the divergence-based mask could be extended to other generative models (e.g., diffusion models) and multi-modal settings, opening new avenues for research in RL-based alignment of generative models.

Flow DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Problem & Motivation

Architecture & Method

Training & Implementation

Results & Comparisons

Limitations & Significance