An online length-aware scheduling strategy that eliminates training 'bubbles' during the rollout phase of LLM reinforcement learning.
March 25, 2026
Original Paper
SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling
arXiv · 2603.23414
The Takeaway
Rollout is the primary bottleneck in RL training for long-chain-of-thought models. This method reorders samples by length to allow earlier updates and reduces compute idle time by over 50%, significantly accelerating the training of large reasoning models like Qwen-2.5-32B.
From the abstract
Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose Sor