AI & ML Breaks Assumption

Proves the Key-Value (KV) cache is entirely redundant and can be bit-identically recomputed from the residual stream.

March 23, 2026

Original Paper

The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

Kaleem Ullah Qasim, Jiashu Zhang, Muhammad Kafeel Shaheen, Razan Alharith, Heying Zhang

arXiv · 2603.19664

The Takeaway

This challenges the fundamental assumption that the KV cache is essential state for Transformer inference. It demonstrates that keys and values are deterministic projections of the residual stream, enabling KV-Direct—an inference scheme that reduces memory overhead by ~25x (5KB vs 136KB per token) with zero loss in precision.

From the abstract

The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four a

Read the original paper →

← Back to today's papers