Memory Sparse Attention (MSA) enables LLMs to scale to 100 million tokens with linear complexity and less than 9% precision degradation.
March 26, 2026
Original Paper
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
arXiv · 2603.23516
The Takeaway
It shatters the 1M-token barrier for local hardware, allowing 100M-token inference on just two A800 GPUs. This makes 'lifetime-scale' context practical for long-history agents and massive document corpus analysis without the retrieval gaps of RAG.
From the abstract
Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit inthe field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typicallylimited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storagemethods like RAG or agent systems, attempt to extend this limit. However, they often