Self-Distillation for Multi-Token Prediction

Imagine you are trying to write a long story, but you have a strict rule: you can only write one word at a time. After you write "The," you have to stop, think, and then write "cat." Then stop, think, and write "sat."

This is how most current Large Language Models (LLMs) work. It's accurate, but it's slow. It's like driving a car where you have to come to a complete stop at every single word before you can take the next step.

The Problem: The "One-Word-at-a-Time" Bottleneck

To make these models faster, researchers invented Multi-Token Prediction (MTP). Think of this as giving the model a superpower: instead of just guessing the next word, it tries to guess the next three words all at once.

Old Way: "The" → (stop) → "cat" → (stop) → "sat"
MTP Way: "The" → (guesses) → "cat sat on" (all at once!)

If the model guesses correctly, you save a massive amount of time. But here's the catch: The model isn't very good at guessing the 2nd, 3rd, or 4th words. It's great at the first one, but by the time it gets to the third word, it starts hallucinating or making mistakes. When it makes a mistake, the system has to stop, throw away the bad guess, and go back to the slow, one-word-at-a-time method. This ruins the speedup.

The Solution: MTP-D (The "Self-Teaching" Trick)

The authors of this paper propose a clever solution called MTP-D. They realized that the model's "main brain" (the part that writes one word at a time) is actually very smart, but the "extra heads" (the parts trying to guess multiple words) are a bit clumsy.

So, they used a technique called Self-Distillation.

The Analogy: The Master Chef and the Sous Chefs
Imagine a Master Chef (the Main Head) who knows exactly how to cook a perfect dish. You also have a team of Sous Chefs (the MTP Heads) who are trying to predict the next steps of the recipe.

Before: The Sous Chefs were guessing blindly. They often got the ingredients wrong, so the Master Chef had to stop them and fix the dish.
After (MTP-D): The Master Chef whispers the top 10,000 most likely ingredients to the Sous Chefs before they start guessing. The Sous Chefs don't just guess randomly; they look at the Master's list and say, "Okay, the Master thinks it's likely to be 'salt' or 'pepper,' so I'll focus my energy there."

This is the TopN-Logits part. Instead of trying to guess from a dictionary of 100,000 words, the model only focuses on the top 10,000 words the main brain thinks are likely. This makes the guesses much more accurate.

The "Stop-Gradient" Safety Net
There was a fear: "If we make the Sous Chefs copy the Master so hard, will the Master forget how to cook on their own?"
The authors solved this with a Stop-Gradient trick. Imagine the Master Chef is wearing noise-canceling headphones. The Sous Chefs can listen to the Master to learn, but the Master cannot hear the Sous Chefs' mistakes. This ensures the Master Chef stays perfect while the Sous Chefs get better.

The "Looped Extension": Stretching the Superpower

Once the Sous Chefs got really good at guessing the next 3 or 4 words, the researchers asked: "What if we add even more Sous Chefs? Can we guess 8 or 16 words ahead?"

Usually, adding more guessers makes the system chaotic and slow. But the authors found a way to loop the training.

The Analogy: Imagine you have a team of 4 runners who are trained to run a relay race. Instead of training 8 new runners from scratch, you take the first 4 runners, copy their training style, and use them to teach the next 4 runners. Then you run the whole 8-person team together.
Because the first group was already trained using the "Master Chef" method, they pass down that knowledge perfectly to the new group. This allows the model to scale up to 16 future words without breaking.

The Results: Why Should We Care?

The paper shows that this method is a game-changer:

Faster Guessing: The "Sous Chefs" (MTP heads) now accept the correct guesses 7.5% more often than before.
Massive Speedup: Because they accept more guesses, the model can run 220% faster (more than double the speed) when predicting up to 16 words ahead.
No Loss in Quality: The "Master Chef" (the main model) didn't get any dumber. It still writes perfect stories; it just does it much faster.

In a Nutshell

This paper teaches AI models to look further into the future without getting lost. By having the smartest part of the model gently guide the "guessing" parts, and by cleverly copying that training to add more guessers, we can make AI chatbots and writers twice as fast without sacrificing their intelligence. It's like upgrading from a car that stops at every word to a high-speed train that glides through the text.

1. Problem Statement

As Large Language Models (LLMs) scale, inference efficiency has become a critical bottleneck. Traditional LLMs rely on Next-Token Prediction (NTP), an autoregressive process that generates tokens one by one, incurring high latency and computational costs, especially for long sequences.

Multi-Token Prediction (MTP) has emerged as a solution, allowing models to predict multiple future tokens in parallel using multiple output heads. While industrial models (e.g., DeepSeek-V3, Qwen) have adopted MTP, existing approaches face two significant challenges:

Limited Acceptance Rates: MTP heads often perform worse than the main prediction head. This performance gap leads to a low "acceptance rate" (the probability that a speculative token is correct). As the number of predicted tokens increases, the cumulative acceptance rate drops exponentially, negating the speedup benefits.
Training Difficulties: Jointly training multiple MTP heads alongside the main head is challenging. Adding more heads introduces extra loss terms and hyperparameters, often causing a "seesaw effect" where improving MTP heads degrades the main head's performance, or vice versa.

2. Methodology: MTP-D and Looped Extension

The authors propose MTP-D, a self-distillation framework designed to boost MTP head performance while preserving the main head's accuracy, followed by a Looped Extension Strategy for scaling.

A. Self-Distillation in Pre-Training (MTP-D)

The core innovation is a gradient-detached, TopN-logits-selected self-distillation method.

Teacher-Student Setup: The main head acts as the "teacher," and the MTP heads act as "students."
Gradient Detachment: To prevent the distillation loss from interfering with the main head's optimization, the logits of the main head are stop-gradient (detached). Gradients flow only through the MTP heads during the backward pass.
TopN Logit Selection: Given the massive vocabulary sizes (e.g., 122k tokens), distilling over the full vocabulary is computationally expensive and numerically unstable due to the long-tail distribution of probabilities. The authors select only the TopN (10,000) logits with the highest probabilities from the main head to guide the MTP heads.
Loss Function: The total loss combines the standard Cross-Entropy loss (aligning with ground truth) and a Kullback-Leibler (KL) divergence loss (aligning the MTP head's probability distribution with the TopN logits of the main head).
$L_{total} = L_{CE} + L_{KL}$

B. Looped Extension Strategy

To scale MTP beyond a small number of heads (e.g., from 4 to 16) without retraining from scratch, the authors propose a looped extension:

Initialization: A group of $m$ trained MTP heads is used to initialize the weights of the next group of $m$ heads.
Continued Pre-Training: The model undergoes continued pre-training on a smaller dataset (70B tokens) with the new heads unfrozen, while the main model and previously trained heads remain frozen.
Iterative Scaling: This process is repeated to expand the number of heads (e.g., 4 $\to$ 8 $\to$ 16).
Benefit: This leverages the structural consistency of the cascaded MTP architecture and the distributional consistency induced by distillation, allowing for economical scaling with minimal additional tokens.

3. Key Contributions

MTP-D Framework: A novel self-distillation method that significantly improves MTP head acceptance rates (+7.5% on average) while maintaining comparable main-head performance with minimal training cost.
Looped Extension Strategy: An economical method to scale MTP heads from 4 up to 16 via continued pre-training, avoiding the need for full retraining.
Systematic Validation: Extensive experiments across seven benchmarks (including AGIEval, GSM8K, MATH) and different model architectures (Dense and MoE) validating the scalability and efficiency of the approach.

4. Experimental Results

The authors evaluated MTP-D on 2B Dense and 10B A1B MoE models using the FineWeb-Edu dataset.

Acceptance Rate Improvement:
- With 4 MTP heads, MTP-D achieved a 7.5% increase in the cumulative acceptance rate of the 4th head compared to standard MTP.
- For a single head (K=1), the acceptance rate improved by 3.6%.
Inference Speedup:
- 4-Head Configuration: Achieved a 22.9% inference speedup compared to the baseline.
- Scaled Configuration (16 Heads): The looped extension strategy enabled a massive speedup of up to 220.4% (approx. 3.2x) compared to a standard 1-head MTP baseline.
- Training-Free Scaling: Even without continued pre-training, MTP-D showed superior scalability compared to standard MTP, maintaining a 26.7% cumulative acceptance rate at the 3rd head (where standard MTP dropped to 0.6%).
Main Head Performance: The main head's accuracy remained comparable to or slightly better than the baseline, proving that the self-distillation did not harm the primary model capabilities.

5. Significance and Insights

Practical Viability: The paper demonstrates that MTP can be practically deployed in industrial LLMs by solving the acceptance rate bottleneck through self-distillation.
Scalability Insight: The study reveals that MTP architectures are inherently scalable due to structural consistency, but this scalability is unlocked only when the output distributions of MTP heads are aligned with the main head (via distillation).
Cost Efficiency: The proposed "looped extension" allows for massive inference speedups with a relatively small amount of additional training data (70B tokens), making it highly cost-effective for large-scale model deployment.
Future Direction: The work provides a blueprint for future LLMs to utilize multi-token prediction for long-sequence reasoning and real-time applications, moving beyond the limitations of autoregressive generation.

In conclusion, MTP-D effectively bridges the gap between theoretical multi-token prediction and practical inference acceleration, offering a robust, scalable, and cost-efficient solution for next-generation LLMs.