AI & ML Efficiency Breakthrough

Pretrained Transformers exhibit a pervasive inter-head linear structure where many attention heads can be reconstructed from a small set of peer heads.

March 17, 2026

Original Paper

Linear Predictability of Attention Heads in Large Language Models

Khalid Shaikh, Asmit Kumar Singh, Rebecca Christopher Dsouza, Shikhar Shiromani

arXiv · 2603.13314

The Takeaway

This discovery allows for a 2x reduction in KV cache memory by only storing 'reference' heads and reconstructing others on the fly. It suggests that modern LLMs are significantly over-parameterized in their attention mechanisms, offering a new path for inference optimization.

From the abstract

Large language model (LLM) inference is increasingly bottlenecked by the Key-Value (KV) cache, yet the fine-grained structure of attention-head activations remains poorly understood. We show that pretrained Transformers exhibit a pervasive inter-head linear structure: for a given token, the Query, Key, and Value (QKV) vectors of an attention head can often be reconstructed as a linear combination of a small number of peer heads, typically within the same layer. Across Llama-3.1-8B, Falcon3-10B,

Read the original paper →

← Back to today's papers