AI & ML Efficiency Breakthrough

Introduces custom CUDA kernels and a sparse packing format that enables Transformers to maintain performance with over 99% feedforward sparsity.

March 25, 2026

Original Paper

Sparser, Faster, Lighter Transformer Language Models

Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami, Llion Jones

arXiv · 2603.23198

The Takeaway

It addresses the massive computational cost of LLMs by making unstructured sparsity practical on modern GPUs. This allows for significantly higher throughput and lower energy usage without the usual 'sparsity tax' or accuracy trade-offs.

From the abstract

Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling effi