TED: Training-Free Experience Distillation for Multimodal Reasoning

The Big Idea: Learning Without "Rewiring" the Brain

Imagine you have a smart student (the Student Model) who is trying to solve complex puzzles. Usually, to make this student smarter, you have to force them to study hard, rewire their brain, and memorize thousands of new facts. This is called Knowledge Distillation in the AI world. It works well, but it's expensive, takes a lot of time, and requires a massive library of textbooks (training data).

TED (Training-Free Experience Distillation) asks a different question: What if the student could get smarter just by reading a helpful cheat sheet, without ever changing their brain?

Instead of rewiring the student's brain, TED gives them a living, breathing "Cheat Sheet" (called Contextual Experience) that gets updated every time they try a problem.

How It Works: The Three-Act Play

Think of the TED process like a Master Chef (Teacher) training a Junior Chef (Student) in a kitchen.

Act 1: The Cooking Contest (Trajectory Generation)

The Junior Chef is given a recipe (a problem) and asked to cook it five different ways at the same time. Maybe one way is too salty, one is burnt, and one is perfect.

The Teacher also cooks the dish, but they only cook it once, and they make sure it's perfect.
The Goal: We now have a bunch of student attempts and one perfect teacher attempt.

Act 2: The Critique Session (Experience Generation)

The Master Chef looks at all the dishes. Instead of just saying "Good job" or "Bad job," the Chef writes down general rules based on what happened.

Example: "When the sauce turns brown too fast, lower the heat immediately."
Example: "Don't add salt until the very end."
These aren't just notes about this specific dish; they are universal cooking tips that apply to any dish.
The Chef then adds these tips to the Cheat Sheet (the Context) that the Junior Chef reads before cooking the next meal.

Act 3: The Cleanup Crew (Experience Compression)

Here is the tricky part. If you keep adding tips to the Cheat Sheet forever, it will become a 1,000-page book that is impossible to read. The Junior Chef will get overwhelmed and forget the important stuff.

TED's Solution: The Master Chef acts as an editor. They look at the Cheat Sheet and ask: "Which tips do we use the most?"
If two tips say the same thing, they merge them into one super-tip.
If a tip is outdated or wrong, they delete it.
This keeps the Cheat Sheet short, punchy, and full of only the most useful advice.

Why Is This a Big Deal?

1. It's "Training-Free" (No Brain Surgery)

Traditional AI learning is like giving the student a lobotomy to install new knowledge. It's heavy, risky, and expensive.
TED is like giving the student a smart notebook. The student doesn't change; they just get better instructions. This means you can use this on cheap computers, on phones, or even with "black box" AI models that you can't touch or change.

2. It's a "Low-Data" Superpower

Usually, to teach an AI, you need millions of examples. TED works with just 100 examples.

Analogy: Imagine learning to drive. Traditional methods require you to drive 10,000 miles to learn the rules. TED is like having a driving instructor who watches you drive 100 miles, writes down the exact mistakes you made, and hands you a laminated card with the rules. You can then drive perfectly without needing those 10,000 miles.

3. It Saves a Fortune

The paper shows that TED is 22 times cheaper than traditional methods.

Traditional: Costs about $288 to train (like hiring a full-time tutor for a month).
TED: Costs about $12 (like buying a few good books).

The Results: Does It Work?

The researchers tested this on hard math and logic puzzles (like visual puzzles and complex equations).

Before TED: The student got about 62% of the answers right.
After TED: The student jumped to 70% right.
The Comparison: This is almost as good as the expensive, full-training method, but it cost a fraction of the price and took a tiny fraction of the time.

The Takeaway

TED proves that you don't always need to "reprogram" an AI to make it smarter. Sometimes, you just need to give it a better, constantly updated set of instructions based on its past mistakes and the teacher's wisdom.

It's the difference between trying to memorize the entire dictionary (traditional training) versus carrying a perfectly curated pocket guide (TED) that tells you exactly what to do when you get stuck.

1. Problem Statement

Traditional Knowledge Distillation (KD) transfers capabilities from a large "teacher" model to a smaller "student" model by updating the student's parameters via gradient-based optimization (e.g., fine-tuning on soft labels or reasoning trajectories). While effective, this approach has significant limitations:

High Resource Cost: It requires substantial computational power and large-scale training data.
Inflexibility: It is impractical for resource-constrained environments (e.g., edge devices) or black-box APIs where model weights cannot be updated.
Data Dependency: Performance often scales with the volume of training data, making it inefficient for low-data scenarios.

The paper asks: Can knowledge distillation be achieved without updating model parameters?

2. Methodology: The TED Framework

The authors propose TED (Training-free Experience Distillation), a framework that shifts the distillation target from model parameters to in-context experience. Instead of learning weights, the student model learns by accumulating and reusing abstract reasoning principles injected into its system prompt.

The framework operates through three iterative stages:

A. Reasoning Trajectory Generation

Parallel Sampling: For a given input $x$ , the student model generates $N$ reasoning trajectories ( $\tau_i$ ) in parallel.
Teacher Generation: The teacher model independently generates its own reasoning trajectory ( $\tau_T$ ).
Compression & Filtering: Raw trajectories are condensed to remove verbosity. Teacher trajectories are filtered to ensure they only proceed if they derive the correct ground-truth answer ( $y$ ).

B. Experience Generation (Teacher-Guided Critique)

Critique Mechanism: The teacher analyzes the student's trajectories against its own correct trajectory and the ground truth.
Abstraction: Instead of storing raw examples, the teacher extracts generalized experiences: reusable reasoning tips, common failure patterns, and correction strategies.
Update Actions: The teacher performs discrete actions on the experience set $E$ $E$ :
- Add: Insert a new distilled principle.
- Modify: Refine an existing principle for better generality.
- Delete: Remove obsolete or harmful rules.
- None: Keep the set unchanged.
Balance: The system ensures a balance of positive (correct) and negative (incorrect) student trajectories to guide the critique effectively.

C. Experience Compression

To prevent context explosion and noise accumulation as the experience set grows:

Utility Tracking: TED tracks the usage frequency of each experience item across training samples.
Teacher-Guided Selection: When the context budget is exceeded, the teacher compresses the set by:
- Merging: Combining redundant items into a higher-level concept.
- Rewriting: Rephrasing for clarity and applicability.
- Deleting: Removing low-utility or noisy items.
Result: A compact, high-utility "persistent context" that evolves over time without ever touching model weights.

3. Key Contributions

Parameter-Free Distillation: TED introduces a novel paradigm where knowledge transfer occurs entirely through contextual experience injection rather than gradient-based parameter updates.
Teacher-Guided Compression: A mechanism that actively manages the experience pool by tracking utility and performing intelligent merging/rewriting, solving the problem of unbounded context growth in in-context learning.
Cross-Modal Generalization: The framework demonstrates that distilled reasoning experiences (e.g., logic patterns) can transfer effectively across different modalities (multimodal to text-only) and model scales.

4. Experimental Results

The authors evaluated TED on MathVision (multimodal math), VisualPuzzles (visual logic), and AIME25 (text-only math), using Qwen3-VL as the student and Kimi-K2.5 as the teacher.

Performance Gains:
- MathVision: Improved Qwen3-VL-8B accuracy from 0.627 (Direct Inference) to 0.702 using only 100 training samples.
- VisualPuzzles: Improved Qwen3-VL-8B from 0.517 to 0.561.
- AIME25: Improved Qwen3-8B from 0.673 to 0.733.
Comparison with Baselines:
- TED outperformed other training-free methods like Reflexion, Memento, and MemCom.
- While fully trained Naive-KD (parameter-based) achieved the highest absolute scores, TED achieved competitive performance with significantly fewer resources.
Cost Efficiency:
- TED reduced training costs by 22.9× compared to Naive-KD.
- Cost Breakdown: Naive-KD required ~~576 GPU hours (~~$288), whereas TED required only ~~8 hours of inference/processing (~~$12.6) with no gradient updates.
Ablation Studies:
- Compression is Critical: Removing the compression mechanism caused performance to drop below direct inference (0.702 $\to$ 0.594) due to noise.
- Teacher Quality: Stronger teachers yielded better distilled experiences.
- Data Efficiency: TED showed diminishing returns with more data (saturating quickly), whereas parameter-based KD continued to improve, highlighting TED's strength in low-data regimes.

5. Significance

Practicality for Edge/Black-Box: TED provides a viable solution for improving model performance in environments where retraining is impossible (e.g., proprietary APIs) or too expensive (edge devices).
Data Efficiency: It proves that meaningful knowledge transfer can occur with as few as 100 samples, making it ideal for specialized or rare domains where large datasets are unavailable.
Paradigm Shift: The paper challenges the necessity of parameter updates for distillation, suggesting that contextual accumulation of abstract reasoning principles is a powerful, lightweight alternative to traditional fine-tuning.

In conclusion, TED demonstrates that by treating "experience" as a compressible, reusable resource rather than a static cache of examples, models can achieve significant reasoning improvements without the computational burden of traditional training.