Integrates fast scalar rewards with slow generative CoT reasoning to reduce reward model token consumption by 20%.
March 24, 2026
Original Paper
Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models
arXiv · 2603.20212
The Takeaway
Generative reward models (GRMs) are accurate but expensive. This 'Fast-Slow' hybrid architecture activates expensive reasoning only when necessary, significantly lowering the compute cost of RLHF and alignment while improving accuracy.
From the abstract
Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios.We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dua