Accelerates MoE inference by speculating future experts to overlap CPU-GPU memory transfers with computation.
March 23, 2026
Original Paper
Speculating Experts Accelerates Inference for Mixture-of-Experts
arXiv · 2603.19289
The Takeaway
For memory-constrained environments where experts must be offloaded to CPU, this prefetching scheme reduces time-per-output-token by 14%. This is highly relevant for local LLM deployment (e.g., Llama-3 405B or Mixtral) on consumer hardware with limited VRAM.
From the abstract
Mixture-of-Experts (MoE) models have gained popularity as a means of scaling the capacity of large language models (LLMs) while maintaining sparse activations and reduced per-token compute. However, in memory-constrained inference settings, expert weights must be offloaded to CPU, creating a performance bottleneck from CPU-GPU transfers during decoding. We propose an expert prefetching scheme that leverages currently computed internal model representations to speculate future experts, enabling m