Introduces Process-Aware Policy Optimization (PAPO) to solve the chronic issue of reward hacking in process reward models (PRMs).
March 30, 2026
Original Paper
Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
arXiv · 2603.26535
The Takeaway
It decouples outcome and process rewards, preventing models from inflating scores through verbosity while ensuring reasoning quality actually improves. This provides a stable path for scaling reasoning models beyond simple outcome-based RLHF.
From the abstract
We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offe