Continual Distillation of Teachers from Different Domains

This paper introduces Continual Distillation, a paradigm in which a student model sequentially learns from a stream of heterogeneous teachers without access to their training data, and proposes Self External Data Distillation (SE2D) to effectively counteract the forgetting of unknown knowledge during the transfer of unknown knowledge using external unlabeled data.

Original authors: Nicolas Michel, Maorong Wang, Jiangpeng He, Toshihiko Yamasaki

Published 2026-05-07
📖 5 min read🧠 Deep dive

Original authors: Nicolas Michel, Maorong Wang, Jiangpeng He, Toshihiko Yamasaki

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to become the world's greatest expert by learning from a series of famous mentors. However, there is a catch: you can only speak with one mentor at a time, and once a mentor leaves, they vanish forever. You cannot return to ask them questions, and you have no access to the original textbooks they used to learn their craft.

This is the core problem the paper addresses, which it terms Continual Distillation.

Here is a breakdown of their idea, the problems they found, and their solution, using simple analogies.

The Setup: The Problem of the "Vanishing Mentor"

In the old days of AI, if a student model wanted to learn, it could access all the data (the textbooks) of previous teachers. But today, AI models (so-called "Foundation Models") are so huge and expensive that we cannot keep them all. We must learn from them one after another as they are released, and then lose access to the old ones.

The student model must learn from a stream of teachers:

  1. Teacher A teaches about animals.
  2. Teacher B teaches about insects.
  3. Teacher C teaches about plants.

The student must learn from A, then from B, then from C, without ever seeing A or B again.

The Two Major Challenges

1. The Problem of the "Blind Spot" (Transferring Invisible Knowledge)
The teachers know things the student has never seen. For example, Teacher A might be an expert in "marine animals," but the student has only seen pictures of "land animals."

  • The Paper's Discovery: If the student practices on a random set of images that neither the student nor the teacher has seen before (let's call this "External Data"), something magical happens. When the teacher looks at these random images, they show uncertainty or confidence. By observing how the teacher reacts to these unknown images, the student can actually learn about the domain of "marine animals," even though the student never directly saw a marine animal.
  • The Metaphor: Imagine a master chef (the teacher) tasting a foreign, unknown fruit. Even if the student has never seen this fruit, the chef's reaction (e.g., "This tastes like a mix of lemon and honey") teaches the student the flavor profile of that fruit. This is called Unseen Knowledge Transfer (UKT).

2. The Problem of "Amnesia" (Forgetting Invisible Knowledge)
Here comes the bad news. When the student proceeds to learn from Teacher B (insects), they begin to forget what Teacher A taught them about marine animals.

  • The Paper's Discovery: Since the student never directly saw the marine animals, this knowledge is fragile. As soon as new information arrives, this old "ghost knowledge" disappears.
  • The Metaphor: It is like learning a new language. If you learned French from a book but never spoke it, and then immediately started studying German, you might forget the French words you only learned "by reading." This is called Unseen Knowledge Forgetting (UKF).

The Solution: "Self-External-Distillation" (SE2D)

The authors realized that standard methods try to memorize the teacher's answers but fail to safely preserve the "ghost knowledge." They proposed a new trick called SE2D.

How it works:
Every time the student finishes learning from a teacher, they take a "snapshot" (a checkpoint) of their brain.

  • Normally, when learning from the next teacher, the student would practice everything.
  • The SE2D Twist: When the student practices on the "External Data" (the random images no one knew), they also practice on their own previous snapshot.
  • The Metaphor: Imagine you are a student. Before starting your new German course, you take a moment to review your old French notes specifically while looking at a random, foreign fruit. You ask yourself: "Based on my old notes, how would I describe this fruit?" This forces your brain to keep the French knowledge alive while you are busy learning German.

By doing this, the student stabilizes the "ghost knowledge" of previous teachers without needing to see the original teachers again.

What They Found (The Results)

  1. The right kind of "randomness" is crucial: The "External Data" (the random images) must be related to what the teachers know to some degree.
    • If the teachers know about animals and the random images are of other animals, the student learns a lot.
    • If the random images are of trucks (completely unrelated), the student gets confused and forgets even more.
  2. The Trade-off: There is a balance. If you focus too much on the new teacher, you forget the old one. If you focus too much on the old one, you don't learn the new one. SE2D helps find the "Goldilocks" zone where the student retains old knowledge while learning the new.
  3. It works: In various tests (such as recognizing different cat breeds or digits), their method helped the student retain more about the "vanished" teachers than other standard methods.

The Conclusion

The paper introduces a new method for how AI can learn from a stream of teachers who disappear after use. They found that using "random" data helps the student learn things they have never seen, but it also causes the student to forget these things quickly. Their solution, SE2D, is like a memory exercise that forces the student to review their past lessons on these random data, ensuring they do not lose the valuable insights from teachers they can no longer reach.

Important Note: The authors warn that this "Unseen Knowledge Transfer" is a double-edged sword. If the random data is poor or biased, the student might accidentally learn bad habits or biases from the teacher without ever noticing. They suggest this needs further investigation, but they do not claim to have solved this specific risk yet.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →