SARL improves reasoning models by rewarding the 'topology' of thoughts rather than just the final answer, enabling effective RL without ground-truth labels.
March 31, 2026
Original Paper
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
arXiv · 2603.27977
The Takeaway
It shifts supervision from the 'destination' (labels) to the 'path' (reasoning structure) by rewarding small-world network properties in the reasoning map. This allows reinforcement learning to be applied to open-ended domains where correctness is ambiguous or expensive to verify.
From the abstract
Reinforcement learning has become central to improving large reasoning models, but its success still relies heavily on verifiable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimization towards final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teachi