AI & ML Efficiency Breakthrough

Accelerates sparse attention by 75% by reusing lightning indexer decisions across layers, tackling the hidden bottleneck in production-grade LLMs.

March 13, 2026

Original Paper

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li

arXiv · 2603.12201

The Takeaway

In models like DeepSeek, the sparse attention indexer itself often retains quadratic complexity; IndexCache exploits cross-layer redundancy to remove this overhead. This is a practical, training-aware optimization that directly reduces serving costs for long-context agentic workflows.

From the abstract

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity

Read the original paper →

← Back to today's papers