Introduces custom CUDA kernels and a sparse packing format that enables Transformers to maintain performance with over 99% feedforward sparsity.
March 25, 2026
Original Paper
Sparser, Faster, Lighter Transformer Language Models
arXiv · 2603.23198
The Takeaway
It addresses the massive computational cost of LLMs by making unstructured sparsity practical on modern GPUs. This allows for significantly higher throughput and lower energy usage without the usual 'sparsity tax' or accuracy trade-offs.
From the abstract
Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling effi