AI & ML Scaling Insight

Provides a geometric 'manifold envelopment' framework to explain why unsupervised RL for mathematical reasoning often collapses and how to stabilize it.

March 18, 2026

Original Paper

When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective

Zelin Zhang, Fei Cheng, Chenhui Chu

arXiv · 2603.16578

The Takeaway

Practitioners trying to scale reasoning without expensive ground-truth labels often face policy collapse. This paper identifies the specific foundational logical priors required for success and offers a diagnostic tool to predict training stability.

From the abstract

Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic instability, such as policy collapse and reward hacking. In this paper, we first design and evaluate a suite

Read the original paper →

← Back to today's papers