AI & ML Efficiency Breakthrough

Optimizes LLM inference scheduling by treating output length as a heavy-tailed distribution rather than a point estimate.

April 2, 2026

Original Paper

Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

Haoyu Zheng, Yongqiang Zhang, Fangcheng Fu, Xiaokai Zhou, Hao Luo, Hongchao Zhu, Yuanyuan Zhu, Hao Wang, Xiao Yan, Jiawei Jiang

arXiv · 2604.00499

The Takeaway

By using the Tail Inflated Expectation (TIE) metric, this approach reduces per-token latency by 2.3x and improves throughput by 1.4x. It solves the 'Head-of-Line' blocking problem in LLM serving by accounting for the stochastic nature of token generation.

From the abstract

To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) toke

Read the original paper →

← Back to today's papers