AI & ML Practical Magic

The standard best practice for training the world's most powerful AI models actually hurts their performance in certain situations.

April 23, 2026

Original Paper

EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

arXiv · 2604.19485

The Takeaway

RLHF relies on a critic model that can introduce too much noise and lead to worse results when rewards are sparse. The EVPO mechanism solves this by automatically switching between critic-based and critic-free training based on real-time data. This discovery fixes a fundamental tension that has plagued the training of every major LLM. Researchers found that more complex training setups are sometimes outperformed by simpler ones if the noise is not managed. It provides a more stable and efficient way to teach AI models how to behave.

From the abstract

Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing