Achieves up to 14.4x higher decoding throughput in long-context LLMs via a training-free framework that reuses sparse memory at semantic boundaries.
March 13, 2026
Original Paper
Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
arXiv · 2603.12038
The Takeaway
By exploiting the stability of attention patterns within semantically coherent spans, it provides a practical way to drastically reduce the cost of long-context and agentic workloads without requiring model retraining or fine-tuning.
From the abstract
Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense