DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, Ming-Yu Liu
We propose a new online reinforcement learning (RL) algorithm for diffusion and flow models based on forward process.
Abstract
Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward–reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to $25\times$ more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.
DiffusionNFT enables efficient online reinforcement learning for diffusion models via forward process optimization with up to 25x efficiency gains.
- DiffusionNFT paradigm for online RL on diffusion models using forward process
- Eliminates likelihood estimation and reverse process solver restrictions
- Achieves 25x higher efficiency than FlowGRPO while outperforming CFG baselines
- Diffusion models
- Online reinforcement learning
- Flow matching
- Reward optimization
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- Diffusion Models
- Reinforcement Learning
- Flow Matching
Related orals
Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning
SparseRL leverages deep RL and pretrained models to generate high-performance CUDA code for sparse matrix operations.
Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
DECS framework reduces reasoning model overthinking by decoupling necessary from redundant tokens via curriculum scheduling.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses RL-trained memory modules to enable LLMs to extrapolate from 8K to 3.5M token contexts with minimal performance degradation.
Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport
Hyperparameter Trajectory Inference uses conditional Lagrangian optimal transport to reconstruct neural network outputs across hyperparameter spectra without expensive retraining.
Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training
Q-RAG fine-tunes embedders for multi-step retrieval using reinforcement learning, achieving state-of-the-art on long-context QA.