AI & ML Efficiency Breakthrough

Memory Sparse Attention (MSA) enables LLMs to scale to 100 million tokens with linear complexity and less than 9% precision degradation.

March 26, 2026

Original Paper

MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen

arXiv · 2603.23516

The Takeaway

It shatters the 1M-token barrier for local hardware, allowing 100M-token inference on just two A800 GPUs. This makes 'lifetime-scale' context practical for long-history agents and massive document corpus analysis without the retrieval gaps of RAG.

From the abstract

Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit inthe field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typicallylimited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storagemethods like RAG or agent systems, attempt to extend this limit. However, they often

Read the original paper →

← Back to today's papers