Extends LLM context from 32K to 128K by teaching models to selectively skip global attention for ~80% of tokens.
March 19, 2026
Original Paper
Learning When to Attend: Conditional Memory Access for Long-Context LLMs
arXiv · 2603.17484
The Takeaway
The L2A layer identifies which tokens actually require long-range memory access and which can rely on local context. This results in a 2x improvement in training throughput and significantly reduces KV cache memory usage, making long-context reasoning much more accessible on commodity hardware.
From the abstract
Language models struggle to generalize beyond pretraining context lengths, limiting long-horizon reasoning and retrieval. Continued pretraining on long-context data can help but is expensive due to the quadratic scaling of Attention. We observe that most tokens do not require (Global) Attention over the entire sequence and can rely on local context. Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invok