Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
This paper identifies key failure modes in on-policy distillation for large language models, such as unstable gradient variance and unreliable teacher guidance, and proposes a robust solution using teacher top-K local support matching with truncated reverse-KL and special-token masking to achieve more stable optimization and improved downstream performance.
, ing, >. The student's system sees it as one token: thinking`.
The Result: The teacher gives the student a bad grade for the first chunk because it doesn't match their dictionary, even though the meaning is perfect. It's like a teacher failing a student for spelling "color" as "colour" when they are both correct, just different dialects.
The Solution: "The Safety Net" (Local Support Matching)
The authors propose a simple fix called Teacher Top-K Local Support Matching.
The Analogy: Instead of the teacher only looking at the one word the student just wrote, the teacher looks at the top 50 most likely words they could have written next.
The Teacher's List: The teacher says, "If I were writing this, I would probably pick one of these 50 words."
The Comparison: The student is compared against this whole list, not just the single word they happened to pick.
The Result:
If the student picks a weird word that isn't on the teacher's list, they get a gentle correction.
If the student picks a word that is on the list, they get a reward.
Crucially: This stops the student from gaming the system by picking random "lucky" words. It forces them to stay within the "safe zone" of what a good answer looks like, without needing to be perfect on every single step.
They also added a few "training wheels":
Top-p Sampling: They force the student to pick from the "most likely" words only, preventing them from wandering off into nonsense too quickly.
Masking: They ignore the "spelling" errors (tokenizer mismatches) so the teacher doesn't get confused by technical formatting issues.
Why This Matters
Think of training an AI like teaching a child to ride a bike.
Old Method: You only tell them "Good!" or "Bad!" based on exactly where their foot was at that split second. They learn to wiggle their foot perfectly but fall over because they aren't balancing.
New Method: You look at their whole body posture and the path they are taking. You guide them to stay on the path. If they wobble, you gently steer them back to the "safe zone" of riding, rather than punishing them for one specific wobble.
The Bottom Line: By changing how the teacher gives feedback—from judging a single word to judging a small group of likely words—the AI learns more stably, makes fewer mistakes, and actually gets better at solving hard math and reasoning problems.
, `) that often suffer from tokenization mismatches are masked during the loss calculation to prevent false negatives. 3. Support Renormalization: Essential for stability, ensuring the probability mass within the truncated support is comparable between models.
3. Theoretical Analysis: Bias-Variance Tradeoff
The paper provides a theoretical analysis comparing Token-Level OPD (current standard) vs. Sequence-Level Reverse-KL (ideal but expensive).
Sequence-Level: Couples each token update to future rewards (return-to-go). It is unbiased relative to the full trajectory objective but has a worst-case variance bound scaling as O(T4) (where T is sequence length), making it unstable for long horizons.
Token-Level: Drops future-reward coupling. It is biased but has a much tighter variance bound of O(T2).
Proposed Method: Occupies the middle ground. It introduces a local distributional coupling (via Top-K) that recovers information discarded by single-token estimates but maintains a variance profile closer to the token-level estimator, avoiding the quartic explosion of the sequence-level estimator.
4. Key Contributions
Theoretical Insight: Demonstrated that token-level OPD is biased but offers a significantly tighter worst-case variance bound than sequence-level objectives, explaining why it is preferred for long-horizon training despite its bias.
Empirical Diagnosis: Identified and visualized three specific failure modes of sampled-token OPD: imbalanced signals, unreliable guidance on drifted prefixes, and tokenizer-induced distortions.
Novel Algorithm: Proposed Teacher Top-K Local Support Matching, implemented as a truncated reverse-KL with top-p sampling and special-token masking.
Performance Gains: Showed that this method yields more stable optimization and better downstream performance compared to standard sampled-token OPD.
5. Experimental Results
The method was evaluated on Qwen2.5-7B-Instruct as the student model in two settings:
A. Single-Task Math Reasoning
Setup: Trained on DAPO-Math-17K using OpenThinker3-7B as the teacher.
Results:
Standard Sampled-Token OPD improved the average score from 28.2 (base) to 36.4.
Adding special-token masking to the baseline improved it to 40.7.
Proposed Method (with masking): Achieved 41.5 average score, outperforming all baselines.
Key Finding: The proposed method is less sensitive to tokenizer mismatches than the baseline, as evidenced by the smaller performance gap between masked and unmasked versions.
B. Multi-Task Agentic + Math Training
Setup: Alternating training between Math reasoning and ALFWorld (agentic tasks).
Results:
The proposed method significantly improved Math performance (e.g., Math500 from 76.0 to 82.0) while maintaining or slightly improving ALFWorld success rates (up to 97.7%).
This confirms the method's ability to handle the "brittleness" of long-horizon token supervision without degrading performance on shorter-horizon agentic tasks.
C. Training Dynamics
Stability: The proposed method exhibited smaller gradient norms, lower clipping-boundary fractions, and more consistent policy entropy compared to the baseline.
Alignment: The log-probability gap between teacher and student decreased more effectively, indicating better alignment even when using sampled-token diagnostics.
6. Significance
This work addresses a critical bottleneck in scaling LLM post-training via distillation. As models move toward longer reasoning chains and agentic behaviors, the standard "one-token" supervision becomes increasingly unreliable due to distribution shift and tokenization artifacts.
The proposed Local Support Matching offers a practical, model-agnostic "fix" that:
Stabilizes Training: Reduces the variance explosion associated with long sequences without requiring complex sequence-level reward modeling.
Improves Robustness: Mitigates the impact of tokenization mismatches and teacher drift.
Enhances Performance: Delivers consistent gains in both single-task reasoning and complex multi-task agentic scenarios.
The paper concludes that while local support matching is a significant improvement, it is a "practical design point" rather than a final solution, suggesting future work should focus on integrating these local objectives with better rollout control and uncertainty handling.