AI & ML Practical Magic

A new controller reduces the frequency of AI brain-switching by 90% while keeping almost all of the model's intelligence intact.

April 25, 2026

Original Paper

Temporally Extended Mixture-of-Experts Models

arXiv · 2604.20156

The Takeaway

Mixture-of-Experts models usually swap between different internal specialists for every single token they process, which creates a massive bottleneck for GPU memory. This approach forces the model to stick with a specific expert for longer stretches of time, cutting the switching rate from 50% down to just 5%. Previous assumptions suggested that constant switching was necessary for the high accuracy these large models are known for. By reducing this overhead, giant models can finally run on much smaller hardware without the massive latency penalties typically seen in sparse architectures. This makes the most powerful AI models accessible to companies that cannot afford racks of high-end server GPUs.

From the abstract

Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a co

Read the original paper →

← Back to today's papers