Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

This paper identifies key failure modes in on-policy distillation for large language models, such as unstable gradient variance and unreliable teacher guidance, and proposes a robust solution using teacher top-K local support matching with truncated reverse-KL and special-token masking to achieve more stable optimization and improved downstream performance.

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, Dongbin Zhao

Published 2026-03-27
📖 3 min read☕ Coffee break read

, ing, >. The student's system sees it as one token: thinking`.

  • The Result: The teacher gives the student a bad grade for the first chunk because it doesn't match their dictionary, even though the meaning is perfect. It's like a teacher failing a student for spelling "color" as "colour" when they are both correct, just different dialects.

The Solution: "The Safety Net" (Local Support Matching)

The authors propose a simple fix called Teacher Top-K Local Support Matching.

The Analogy: Instead of the teacher only looking at the one word the student just wrote, the teacher looks at the top 50 most likely words they could have written next.

  1. The Teacher's List: The teacher says, "If I were writing this, I would probably pick one of these 50 words."
  2. The Comparison: The student is compared against this whole list, not just the single word they happened to pick.
  3. The Result:
    • If the student picks a weird word that isn't on the teacher's list, they get a gentle correction.
    • If the student picks a word that is on the list, they get a reward.
    • Crucially: This stops the student from gaming the system by picking random "lucky" words. It forces them to stay within the "safe zone" of what a good answer looks like, without needing to be perfect on every single step.

They also added a few "training wheels":

  • Top-p Sampling: They force the student to pick from the "most likely" words only, preventing them from wandering off into nonsense too quickly.
  • Masking: They ignore the "spelling" errors (tokenizer mismatches) so the teacher doesn't get confused by technical formatting issues.

Why This Matters

Think of training an AI like teaching a child to ride a bike.

  • Old Method: You only tell them "Good!" or "Bad!" based on exactly where their foot was at that split second. They learn to wiggle their foot perfectly but fall over because they aren't balancing.
  • New Method: You look at their whole body posture and the path they are taking. You guide them to stay on the path. If they wobble, you gently steer them back to the "safe zone" of riding, rather than punishing them for one specific wobble.

The Bottom Line: By changing how the teacher gives feedback—from judging a single word to judging a small group of likely words—the AI learns more stably, makes fewer mistakes, and actually gets better at solving hard math and reasoning problems.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →