HISA eliminates the quadratic O(L²) bottleneck in sparse attention indexers, enabling efficient long-context scaling for models like DeepSeek-V3.
March 31, 2026
Original Paper
HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
arXiv · 2603.28458
The Takeaway
It introduces a hierarchical search process that replaces flat token scans with a block-level coarse filter, achieving up to 4x speedups at 128K context. This is a drop-in replacement that preserves the exact top-k sparsity pattern without requiring any fine-tuning.
From the abstract
Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical token for each query using a lightweight indexer, and then computing attention only over the selected subset. While the downstream sparse attention scales efficiently, the indexer still scans the entire prefix for every query, introducing an O($L^2$) per-layer bottleneck that becomes prohibitive as context length grows. We propose HISA (Hierarchi