Introduces Reward Sharpness-Aware Fine-Tuning (RSA-FT) to mitigate reward hacking in diffusion models without retraining reward models.
March 24, 2026
Original Paper
Reward Sharpness-Aware Fine-Tuning for Diffusion Models
arXiv · 2603.21175
The Takeaway
It makes image-based RLHF significantly more robust. By exploiting gradients from a 'flattened' reward landscape, it prevents the model from generating high-scoring but perceptually poor images, a common failure in current reward-centric diffusion.
From the abstract
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arise