Introduces Centered Reward Distillation (CRD) to stabilize diffusion reinforcement learning by removing intractable normalizing constants.
March 17, 2026
Original Paper
Diffusion Reinforcement Learning via Centered Reward Distillation
arXiv · 2603.14128
The Takeaway
RL fine-tuning for diffusion (like aligning stable diffusion to human preferences) is notoriously brittle and prone to reward hacking. CRD offers a more stable mathematical framework for reward-matching that prevents distribution drift, making it easier to align generative models to complex black-box rewards.
From the abstract
Diffusion and flow models achieve State-Of-The-Art (SOTA) generative performance, yet many practically important behaviors such as fine-grained prompt fidelity, compositional correctness, and text rendering are weakly specified by score or flow matching pretraining objectives. Reinforcement Learning (RL) fine-tuning with external, black-box rewards is a natural remedy, but diffusion RL is often brittle. Trajectory-based methods incur high memory cost and high-variance gradient estimates; forward