Enhances mathematical reasoning in LLMs by integrating Group Relative Policy Optimization (GRPO) with a specific reflection reward mechanism.
March 17, 2026
Original Paper
GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models
arXiv · 2603.14041
The Takeaway
While DeepSeek-R1 popularized GRPO, this paper provides a concrete recipe for encouraging proactive self-reflection during training rather than just following format. It demonstrates that cognitive rewards for internal reflection significantly boost performance over standard RLHF/SFT approaches.
From the abstract
The enhancement of reasoning capabilities in large language models (LLMs) has garnered significant attention, with supervised fine-tuning (SFT) and reinforcement learning emerging as dominant paradigms. While recent studies recognize the importance of reflection in reasoning processes, existing methodologies seldom address proactive reflection encouragement during training. This study focuses on mathematical reasoning by proposing a four-stage framework integrating Group Relative Policy Optimiza