AI & ML Efficiency Breakthrough

An online length-aware scheduling strategy that eliminates training 'bubbles' during the rollout phase of LLM reinforcement learning.

March 25, 2026

Original Paper

SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang, Yifei Shen, Dongsheng Li, Yuqing Yang, Lili Qiu, Yang You

arXiv · 2603.23414

The Takeaway

Rollout is the primary bottleneck in RL training for long-chain-of-thought models. This method reorders samples by length to allow earlier updates and reduces compute idle time by over 50%, significantly accelerating the training of large reasoning models like Qwen-2.5-32B.

From the abstract

Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose Sor