The 'consensus trap' in label-free RL—where models reinforce their own systematic errors—can be broken by co-evolving the model in alternating generator and verifier roles.
March 19, 2026
Original Paper
CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution
arXiv · 2603.17775
The Takeaway
Instead of relying on majority voting (which often collapses diversity and reinforces errors), this framework uses the verifier to filter pseudo-labels for the generator. This bootstrapping mechanism significantly improves reasoning benchmarks (4.7-5.9%) without needing ground-truth labels.
From the abstract
Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternat