You can run 1B+ parameter models while only activating 5% of the weights, with zero loss in performance.
April 15, 2026
Original Paper
Dynamic sparsity in tree-structured feed-forward layers at scale
arXiv · 2604.08565
The Takeaway
Traditional LLMs are compute-heavy because every token hits every weight. This paper demonstrates that tree-structured conditional sparsity can scale beyond 1 billion parameters while matching dense model performance. By activating fewer than 5% of units per token, the inference cost drops dramatically without sacrificing intelligence. This is the roadmap for moving massive AI capabilities onto edge devices and drastically reducing the power footprint of data centers. It proves that model size doesn't have to be a death sentence for inference latency.
From the abstract
At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied fo