Speeds up RL-based reasoning training by 1.7x using an online quality head to prune failing rollouts mid-generation.
March 27, 2026
Original Paper
Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR
arXiv · 2603.24840
The Takeaway
Methods like GRPO are slow because they require full rollouts for every prompt; 'arrol' prunes hopeless samples during generation. This significantly reduces compute waste and focuses the learning signal on samples that actually contribute to reasoning improvements.
From the abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce