AI & ML Efficiency Breakthrough

EntropyCache achieves up to 26x speedup for Diffusion Language Models by using decoded token entropy as a proxy for KV cache staleness.

March 20, 2026

Original Paper

EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Minsoo Cheong, Donghyun Son, Woosang Lim, Sungjoo Yoo

arXiv · 2603.18489

The Takeaway

Diffusion-based LLMs typically require full forward passes at every denoising step; this method enables training-free, sparse KV caching that costs only 0.5% of inference time. It makes a new class of non-autoregressive models computationally competitive with standard Transformers.

From the abstract

Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for decid