AI & ML Paradigm Shift

Introduces Process-Aware Policy Optimization (PAPO) to solve the chronic issue of reward hacking in process reward models (PRMs).

March 30, 2026

Original Paper

Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

Zelin Tan, Zhouliang Yu, Bohan Lin, Zijie Geng, Hejia Geng, Yudong Zhang, Mulei Zhang, Yang Chen, Shuyue Hu, Zhenfei Yin, Chen Zhang, Lei Bai

arXiv · 2603.26535

The Takeaway

It decouples outcome and process rewards, preventing models from inflating scores through verbosity while ensuring reasoning quality actually improves. This provides a stable path for scaling reasoning models beyond simple outcome-based RLHF.

From the abstract

We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offe

Read the original paper →

← Back to today's papers