AI & ML Efficiency Breakthrough

Speeds up RL-based reasoning training by 1.7x using an online quality head to prune failing rollouts mid-generation.

March 27, 2026

Original Paper

Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

Haobo Xu, Sirui Chen, Ruizhong Qiu, Yuchen Yan, Chen Luo, Monica Cheng, Jingrui He, Hanghang Tong

arXiv · 2603.24840

The Takeaway

Methods like GRPO are slow because they require full rollouts for every prompt; 'arrol' prunes hopeless samples during generation. This significantly reduces compute waste and focuses the learning signal on samples that actually contribute to reasoning improvements.

From the abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce