Pretrained Transformers exhibit a pervasive inter-head linear structure where many attention heads can be reconstructed from a small set of peer heads.
March 17, 2026
Original Paper
Linear Predictability of Attention Heads in Large Language Models
arXiv · 2603.13314
The Takeaway
This discovery allows for a 2x reduction in KV cache memory by only storing 'reference' heads and reconstructing others on the fly. It suggests that modern LLMs are significantly over-parameterized in their attention mechanisms, offering a new path for inference optimization.
From the abstract
Large language model (LLM) inference is increasingly bottlenecked by the Key-Value (KV) cache, yet the fine-grained structure of attention-head activations remains poorly understood. We show that pretrained Transformers exhibit a pervasive inter-head linear structure: for a given token, the Query, Key, and Value (QKV) vectors of an attention head can often be reconstructed as a linear combination of a small number of peer heads, typically within the same layer. Across Llama-3.1-8B, Falcon3-10B,