Self-Distillation for Multi-Token Prediction

This paper proposes MTP-D, a self-distillation method with a looped extension strategy that significantly improves multi-token prediction acceptance rates and inference speed while preserving main-head performance in Large Language Models.

Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun

Published 2026-03-26
📖 5 min read🧠 Deep dive

Imagine you are trying to write a long story, but you have a strict rule: you can only write one word at a time. After you write "The," you have to stop, think, and then write "cat." Then stop, think, and write "sat."

This is how most current Large Language Models (LLMs) work. It's accurate, but it's slow. It's like driving a car where you have to come to a complete stop at every single word before you can take the next step.

The Problem: The "One-Word-at-a-Time" Bottleneck

To make these models faster, researchers invented Multi-Token Prediction (MTP). Think of this as giving the model a superpower: instead of just guessing the next word, it tries to guess the next three words all at once.

  • Old Way: "The" → (stop) → "cat" → (stop) → "sat"
  • MTP Way: "The" → (guesses) → "cat sat on" (all at once!)

If the model guesses correctly, you save a massive amount of time. But here's the catch: The model isn't very good at guessing the 2nd, 3rd, or 4th words. It's great at the first one, but by the time it gets to the third word, it starts hallucinating or making mistakes. When it makes a mistake, the system has to stop, throw away the bad guess, and go back to the slow, one-word-at-a-time method. This ruins the speedup.

The Solution: MTP-D (The "Self-Teaching" Trick)

The authors of this paper propose a clever solution called MTP-D. They realized that the model's "main brain" (the part that writes one word at a time) is actually very smart, but the "extra heads" (the parts trying to guess multiple words) are a bit clumsy.

So, they used a technique called Self-Distillation.

The Analogy: The Master Chef and the Sous Chefs
Imagine a Master Chef (the Main Head) who knows exactly how to cook a perfect dish. You also have a team of Sous Chefs (the MTP Heads) who are trying to predict the next steps of the recipe.

  • Before: The Sous Chefs were guessing blindly. They often got the ingredients wrong, so the Master Chef had to stop them and fix the dish.
  • After (MTP-D): The Master Chef whispers the top 10,000 most likely ingredients to the Sous Chefs before they start guessing. The Sous Chefs don't just guess randomly; they look at the Master's list and say, "Okay, the Master thinks it's likely to be 'salt' or 'pepper,' so I'll focus my energy there."

This is the TopN-Logits part. Instead of trying to guess from a dictionary of 100,000 words, the model only focuses on the top 10,000 words the main brain thinks are likely. This makes the guesses much more accurate.

The "Stop-Gradient" Safety Net
There was a fear: "If we make the Sous Chefs copy the Master so hard, will the Master forget how to cook on their own?"
The authors solved this with a Stop-Gradient trick. Imagine the Master Chef is wearing noise-canceling headphones. The Sous Chefs can listen to the Master to learn, but the Master cannot hear the Sous Chefs' mistakes. This ensures the Master Chef stays perfect while the Sous Chefs get better.

The "Looped Extension": Stretching the Superpower

Once the Sous Chefs got really good at guessing the next 3 or 4 words, the researchers asked: "What if we add even more Sous Chefs? Can we guess 8 or 16 words ahead?"

Usually, adding more guessers makes the system chaotic and slow. But the authors found a way to loop the training.

  • The Analogy: Imagine you have a team of 4 runners who are trained to run a relay race. Instead of training 8 new runners from scratch, you take the first 4 runners, copy their training style, and use them to teach the next 4 runners. Then you run the whole 8-person team together.
  • Because the first group was already trained using the "Master Chef" method, they pass down that knowledge perfectly to the new group. This allows the model to scale up to 16 future words without breaking.

The Results: Why Should We Care?

The paper shows that this method is a game-changer:

  1. Faster Guessing: The "Sous Chefs" (MTP heads) now accept the correct guesses 7.5% more often than before.
  2. Massive Speedup: Because they accept more guesses, the model can run 220% faster (more than double the speed) when predicting up to 16 words ahead.
  3. No Loss in Quality: The "Master Chef" (the main model) didn't get any dumber. It still writes perfect stories; it just does it much faster.

In a Nutshell

This paper teaches AI models to look further into the future without getting lost. By having the smartest part of the model gently guide the "guessing" parts, and by cleverly copying that training to add more guessers, we can make AI chatbots and writers twice as fast without sacrificing their intelligence. It's like upgrading from a car that stops at every word to a high-speed train that glides through the text.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →