AI & ML Efficiency Breakthrough

A self-distillation method for Multi-Token Prediction (MTP) that yields a 220% inference speedup with minimal training cost.

March 26, 2026

Original Paper

Self-Distillation for Multi-Token Prediction

Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun

arXiv · 2603.23911

The Takeaway

Multi-token prediction is a key strategy for faster LLM inference, but training multiple heads is notoriously difficult. This self-distillation approach makes MTP heads significantly more accurate and easier to scale, providing a practical path to doubling inference throughput on existing hardware.

From the abstract

As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head ac

Read the original paper →

← Back to today's papers