AI & ML Paradigm Shift

Introduces Centered Reward Distillation (CRD) to stabilize diffusion reinforcement learning by removing intractable normalizing constants.

March 17, 2026

Original Paper

Diffusion Reinforcement Learning via Centered Reward Distillation

Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton

arXiv · 2603.14128

The Takeaway

RL fine-tuning for diffusion (like aligning stable diffusion to human preferences) is notoriously brittle and prone to reward hacking. CRD offers a more stable mathematical framework for reward-matching that prevents distribution drift, making it easier to align generative models to complex black-box rewards.

From the abstract

Diffusion and flow models achieve State-Of-The-Art (SOTA) generative performance, yet many practically important behaviors such as fine-grained prompt fidelity, compositional correctness, and text rendering are weakly specified by score or flow matching pretraining objectives. Reinforcement Learning (RL) fine-tuning with external, black-box rewards is a natural remedy, but diffusion RL is often brittle. Trajectory-based methods incur high memory cost and high-variance gradient estimates; forward