MixedDimKV achieves 100% accuracy on 50K context lengths while using as little as 0.26% of the traditional KV cache.
March 24, 2026
Original Paper
Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression
arXiv · 2603.20616
The Takeaway
By moving beyond simple token eviction to granular mixed-dimension budget allocation, this method enables massive context handling on hardware that would typically run out of memory, outperforming standard head-pruning techniques.
From the abstract
Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and