Replacing massive weight matrices with a simple Gaussian formula reduces AI parameter counts by over $50\%$ without hurting performance.
Standard Transformers use huge matrices to figure out which words are important. This paper shows that a basic Gaussian kernel can do the same job with a fraction of the hardware requirements. By removing these learned projections, researchers cut the model complexity while keeping its accuracy competitive. This suggests that much of the intelligence in modern AI might be coming from basic math rather than the specific weights we spend millions to train. It paves the way for much smaller, faster models that can run on consumer devices.
Projection-Free Transformers via Gaussian Kernel Attention
arXiv · 2605.02144
Self-attention in Transformers is typically implemented as $\mathrm{softmax}(QK^\top/\sqrt{d})V$, where $Q=XW_Q$, $K=XW_K$, and $V=XW_V$ are learned linear projections of the input $X$. We ask whether these learned projections are necessary, or whether they can be replaced by a simpler similarity-based diffusion operator. We introduce \textbf{Gaussian Kernel Attention} (GKA), a drop-in replacement for dot-product attention that computes token affinities directly using a Gaussian radial basis fun