Introduces the first reinforcement learning framework to compress implicit reasoning steps in looped language models.
March 23, 2026
Original Paper
LoopRPT: Reinforcement Pre-Training for Looped Language Models
arXiv · 2603.19714
The Takeaway
Looped LMs offer a compact alternative to Chain-of-Thought, but training their latent steps is difficult. LoopRPT uses RL signals to shape intermediate representations, allowing models to achieve Pareto dominance by reaching higher accuracy with significantly fewer computational iterations.
From the abstract
Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a