SCRL introduces the first negative supervision mechanism for Test-Time Reinforcement Learning, preventing LLMs from reinforcing 'consensus lies'.
March 23, 2026
Original Paper
What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time
arXiv · 2603.19880
The Takeaway
Current test-time scaling methods rely on majority voting, which fails when models are collectively wrong. SCRL adds entropy-gated negative labeling to prune incorrect trajectories, making reasoning models significantly more robust during inference-time compute scaling.
From the abstract
Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. I