AI & ML Efficiency Breakthrough

HISA eliminates the quadratic O(L²) bottleneck in sparse attention indexers, enabling efficient long-context scaling for models like DeepSeek-V3.

March 31, 2026

Original Paper

HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

Yufei Xu, Fanxu Meng, Fan Jiang, Yuxuan Wang, Ruijie Zhou, Jiexi Wu, Zhixin Pan, Zhaohui Wang, Xiaojuan Tang, Wenjie Pei, Tongxuan Liu, Di yin, Xing Sun, Muhan Zhang

arXiv · 2603.28458

The Takeaway

It introduces a hierarchical search process that replaces flat token scans with a block-level coarse filter, achieving up to 4x speedups at 128K context. This is a drop-in replacement that preserves the exact top-k sparsity pattern without requiring any fine-tuning.

From the abstract

Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical token for each query using a lightweight indexer, and then computing attention only over the selected subset. While the downstream sparse attention scales efficiently, the indexer still scans the entire prefix for every query, introducing an O($L^2$) per-layer bottleneck that becomes prohibitive as context length grows. We propose HISA (Hierarchi

Read the original paper →

← Back to today's papers