AI & ML Nature Is Weird

Switching to multi-token prediction forces Transformers to stop guessing the next word and start planning their reasoning backward from the goal.

April 17, 2026

Original Paper

How Transformers Learn to Plan via Multi-Token Prediction

arXiv · 2604.11912

The Takeaway

We used to think next-token prediction was the only way, but it results in 'myopic' models that can't plan ahead. This work reveals that multi-token prediction induces a two-stage process where the model first identifies the goal and then traces the path backward to the start. It fundamentally shifts the model's internal thought process from linear extrapolation to structured, goal-oriented planning. This unlocks a new generation of LLMs that can solve complex multi-step problems by visualizing the finish line before they even begin. It changes the fundamental training objective from simple imitation to active problem-solving.

From the abstract

While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic r

Read the original paper →

← Back to today's papers