AI & ML Paradigm Shift

The 'consensus trap' in label-free RL—where models reinforce their own systematic errors—can be broken by co-evolving the model in alternating generator and verifier roles.

March 19, 2026

Original Paper

CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

Teng Pan, Yuchen Yan, Zixuan Wang, Ruiqing Zhang, Gaiyang Han, Wanqi Zhang, Weiming Lu, Jun Xiao, Yongliang Shen

arXiv · 2603.17775

The Takeaway

Instead of relying on majority voting (which often collapses diversity and reinforces errors), this framework uses the verifier to filter pseudo-labels for the generator. This bootstrapping mechanism significantly improves reasoning benchmarks (4.7-5.9%) without needing ground-truth labels.

From the abstract

Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternat

Read the original paper →

← Back to today's papers