A self-distillation method for Multi-Token Prediction (MTP) that yields a 220% inference speedup with minimal training cost.
March 26, 2026
Original Paper
Self-Distillation for Multi-Token Prediction
arXiv · 2603.23911
The Takeaway
Multi-token prediction is a key strategy for faster LLM inference, but training multiple heads is notoriously difficult. This self-distillation approach makes MTP heads significantly more accurate and easier to scale, providing a practical path to doubling inference throughput on existing hardware.
From the abstract
As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head ac