AI & ML Efficiency Breakthrough

Fuses categorical sampling into the LM-head matmul to eliminate logit materialization and speed up LLM decoding by up to 19%.

March 18, 2026

Original Paper

FlashSampling: Fast and Memory-Efficient Exact Sampling

Tomas Ruiz, Zhen Qin, Yifan Zhang, Xuyang Shen, Yiran Zhong, Mengdi Wang

arXiv · 2603.15854

The Takeaway

FlashSampling turns a bandwidth-bound post-processing step into a fused kernel epilogue without any approximation. This is a significant optimization for high-throughput inference engines like vLLM.

From the abstract

Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over til