The standard best practice for training the world's most powerful AI models actually hurts their performance in certain situations.
April 23, 2026
Original Paper
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
arXiv · 2604.19485
The Takeaway
RLHF relies on a critic model that can introduce too much noise and lead to worse results when rewards are sparse. The EVPO mechanism solves this by automatically switching between critic-based and critic-free training based on real-time data. This discovery fixes a fundamental tension that has plagued the training of every major LLM. Researchers found that more complex training setups are sometimes outperformed by simpler ones if the noise is not managed. It provides a more stable and efficient way to teach AI models how to behave.
From the abstract
Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing