AI & ML New Capability

Develops a differentially private RLHF pipeline that decouples private reward learning from policy optimization, achieving strong alignment on Gemma-2B-IT with privacy guarantees.

March 25, 2026

Original Paper

Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

Young Hyun Cho, Will Wei Sun

arXiv · 2603.22563

The Takeaway

RLHF often uses sensitive human preference data; this paper provides a practical way to ensure the resulting model doesn't leak that information. The decoupled approach is significant because it allows the computationally expensive RL phase to remain non-private while protecting the sensitive data-dependent reward model.

From the abstract

Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward mode