Pruning low-utility prompts before RL rollouts allows for 10x more efficient training of large reasoning models.
March 27, 2026
Original Paper
Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model
arXiv · 2603.25184
The Takeaway
Reinforcement learning for reasoning (e.g., GRPO) is bottlenecked by expensive rollouts. By identifying the 'learning edge'—prompts with high uncertainty and intermediate difficulty—HIVE significantly reduces compute costs without sacrificing model performance.
From the abstract
Reinforcement learning (RL) has become essential for post-training large language models (LLMs) in reasoning tasks. While scaling rollouts can stabilize training and enhance performance, the computational overhead is a critical issue. In algorithms like GRPO, multiple rollouts per prompt incur prohibitive costs, as a large portion of prompts provide negligible gradients and are thus of low utility. To address this problem, we investigate how to select high-utility prompts before the rollout phas