AI & ML Efficiency Breakthrough

Integrates fast scalar rewards with slow generative CoT reasoning to reduce reward model token consumption by 20%.

March 24, 2026

Original Paper

Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

Jiayun Wu, Peixu Hou, Shan Qu, Peng Zhang, Ning Gu, Tun Lu

arXiv · 2603.20212

The Takeaway

Generative reward models (GRMs) are accurate but expensive. This 'Fast-Slow' hybrid architecture activates expensive reasoning only when necessary, significantly lowering the compute cost of RLHF and alignment while improving accuracy.

From the abstract

Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios.We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dua