AI & ML Efficiency Breakthrough

MoE-Sieve reduces Mixture-of-Experts LoRA fine-tuning parameters and training time by ~70% by only adapting the most-frequently activated 'hot' experts.

March 26, 2026

Original Paper

MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning

Andrea Manzoni

arXiv · 2603.24044

The Takeaway

It challenges the practice of applying LoRA to all experts in an MoE model, proving that adapting 'cold' experts is often counterproductive and adds gradient noise. This significantly lowers the hardware barrier for fine-tuning massive MoE models like Mixtral or DeepSeek without sacrificing performance.

From the abstract

Standard LoRA fine-tuning of Mixture-of-Experts (MoE) models applies adapters to every expert, yet our profiling shows that per-layer expert routing is highly skewed: a small subset of experts handles most tokens in each layer, while many others are rarely activated ("cold"). We propose MoE-Sieve, a simple routing-guided framework for LoRA fine-tuning, and pair it with a systematic profiling study of expert routing across architectures and tasks. The method is simple: profile routing counts on a