AI & ML New Capability

Introduces Reward Sharpness-Aware Fine-Tuning (RSA-FT) to mitigate reward hacking in diffusion models without retraining reward models.

March 24, 2026

Original Paper

Reward Sharpness-Aware Fine-Tuning for Diffusion Models

Kwanyoung Kim, Byeongsu Sim

arXiv · 2603.21175

The Takeaway

It makes image-based RLHF significantly more robust. By exploiting gradients from a 'flattened' reward landscape, it prevents the model from generating high-scoring but perceptually poor images, a common failure in current reward-centric diffusion.

From the abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arise