Proposes a unified tensor-factorization view of attention that encompasses MHA, GQA, and MLA while reducing parameter counts by an order of magnitude.
April 1, 2026
Original Paper
Tucker Attention: A generalization of approximate attention mechanisms
arXiv · 2603.30033
The Takeaway
As LLMs move toward long-context windows, optimizing the KV cache and attention parameters is critical. Tucker Attention provides a generalized way to achieve extreme parameter efficiency without the trial-and-error of specific architectural tweaks like Group-Query Attention.
From the abstract
The pursuit of reducing the memory footprint of the self-attention mechanism in multi-headed self attention (MHA) spawned a rich portfolio of methods, e.g., group-query attention (GQA) and multi-head latent attention (MLA). The methods leverage specialized low-rank factorizations across embedding dimensions or attention heads. From the point of view of classical low-rank approximation, these methods are unconventional and raise questions of which objects they really approximate and how to interp