AI & ML Efficiency Breakthrough

Accelerates LLM inference by up to 1.8x using a training-free sparse pattern predictor based on SVD truncation of FFN gate matrices.

March 17, 2026

Original Paper

SVD Contextual Sparsity Predictors for Fast LLM Inference

Georgii Serbin, Kirill Koshkin, Zhongao Sun, Anastasiya Bistrigova, C.C. Korikov

arXiv · 2603.14110

The Takeaway

Existing contextual sparsity methods often require expensive retraining of predictors; this approach uses simple SVD on the weights to predict activation patterns. It provides a practical, drop-in speedup for ReGLU-based LLMs (like Llama/Mistral) on edge devices without sacrificing accuracy.

From the abstract

Contextual sparsity is one of the approaches used to reduce computational complexity in the inference process of large language models (LLMs). Existing techniques for efficient LLM inference acceleration based on contextual sparsity with minimal accuracy degradation require training sparse pattern predictors. This paper presents a framework for accelerating inference of ReGLU-based feed-forward networks (FFNs) within LLMs. The proposed framework provides a fast, training-free method for building