AI & ML Efficiency Breakthrough

Accelerates MoE inference by speculating future experts to overlap CPU-GPU memory transfers with computation.

March 23, 2026

Original Paper

Speculating Experts Accelerates Inference for Mixture-of-Experts

Vivan Madan, Prajwal Singhania, Abhinav Bhatele, Tom Goldstein, Ashwinee Panda

arXiv · 2603.19289

The Takeaway

For memory-constrained environments where experts must be offloaded to CPU, this prefetching scheme reduces time-per-output-token by 14%. This is highly relevant for local LLM deployment (e.g., Llama-3 405B or Mixtral) on consumer hardware with limited VRAM.

From the abstract

Mixture-of-Experts (MoE) models have gained popularity as a means of scaling the capacity of large language models (LLMs) while maintaining sparse activations and reduced per-token compute. However, in memory-constrained inference settings, expert weights must be offloaded to CPU, creating a performance bottleneck from CPU-GPU transfers during decoding. We propose an expert prefetching scheme that leverages currently computed internal model representations to speculate future experts, enabling m

Read the original paper →

← Back to today's papers