AI & ML Efficiency Breakthrough

Extends LLM context from 32K to 128K by teaching models to selectively skip global attention for ~80% of tokens.

March 19, 2026

Original Paper

Learning When to Attend: Conditional Memory Access for Long-Context LLMs

Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager, Wei Xia, Stefano Soatto

arXiv · 2603.17484

The Takeaway

The L2A layer identifies which tokens actually require long-range memory access and which can rely on local context. This results in a 2x improvement in training throughput and significantly reduces KV cache memory usage, making long-context reasoning much more accessible on commodity hardware.

From the abstract

Language models struggle to generalize beyond pretraining context lengths, limiting long-horizon reasoning and retrieval. Continued pretraining on long-context data can help but is expensive due to the quadratic scaling of Attention. We observe that most tokens do not require (Global) Attention over the entire sequence and can rely on local context. Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invok

Read the original paper →

← Back to today's papers