A Unified Framework for Knowledge Transfer in Bidirectional Model Scaling

This paper introduces BoT, a unified, size-agnostic framework that treats model weights as continuous signals to enable efficient bidirectional knowledge transfer between models of different sizes using Discrete Wavelet Transforms, achieving significant pre-training FLOPs savings and state-of-the-art performance.

Jianlu Shen, Fu Feng, Jiaze Xu, Yucheng Xie, Jiaqi Lv, Xin Geng

Published 2026-03-10
📖 5 min read🧠 Deep dive

The Big Problem: The "Model Zoo" Dilemma

Imagine a massive library of AI models, called a "Model Zoo." Inside, you have tiny, fast models (like a Smartphone) and huge, powerful models (like a Supercomputer).

Usually, if you want to use the knowledge from the Supercomputer to help the Smartphone, or vice versa, you hit a wall.

  • The Smartphone problem: You can't just copy the Supercomputer's brain into the Smartphone; it's too big and won't fit.
  • The Supercomputer problem: You can't just copy the Smartphone's brain into the Supercomputer; it's too small and leaves empty space.

Currently, scientists have to use two completely different, messy tools to fix this:

  1. For Small-to-Large: They try to "stretch" the small brain, often by copying neurons or using complex math to guess how to fill the gaps. It's like trying to stretch a small rubber band to fit a giant balloon—it often snaps or looks weird.
  2. For Large-to-Small: They try to "chop" the big brain, picking random pieces to keep. It's like trying to make a salad by randomly grabbing leaves from a giant tree; you might lose the important nutrients.

The Solution: BoT (Bidirectional Knowledge Transfer)

The authors propose a new method called BoT. Their big idea is to stop thinking of AI models as rigid blocks of code and start thinking of them as continuous signals, like a radio wave or a music track.

The Core Analogy: The "Learngene" as a Music Track

Imagine the knowledge inside a trained AI model is a song.

  • The "Learngene" is the core melody of that song.

  • The Model Size is just the resolution or quality of the audio file.

  • A Small Model (Low Resolution): It's like a low-quality MP3 or a blurry thumbnail image. It captures the main melody (the low-frequency notes) but misses the tiny details.

  • A Large Model (High Resolution): It's like a high-fidelity FLAC file or a 4K image. It has the main melody plus all the high-frequency details and nuances.

The Magic Trick:
The authors realized that moving between these sizes isn't about chopping or stretching. It's about changing the volume and frequency, just like a sound engineer does with music.

How BoT Works: The Wavelet Wave

They use a mathematical tool called the Discrete Wavelet Transform (DWT). Think of this as a Magic Zoom Lens or a Sound Mixer.

1. Large-to-Small (The "Downsampling" or "Compression")

  • Goal: Take the Supercomputer's brain and shrink it for the Smartphone without losing the main ideas.
  • How BoT does it: It uses the DWT to filter out the "high-frequency noise" (the tiny, specific details) and keeps only the "low-frequency core" (the main melody).
  • The Result: You get a compact, efficient version of the knowledge that fits perfectly into the smaller model. It's like taking a 4K movie and compressing it into a clear, standard-definition video that still tells the whole story perfectly.

2. Small-to-Large (The "Upsampling" or "Expansion")

  • Goal: Take the Smartphone's brain and expand it to fit the Supercomputer.
  • How BoT does it: It takes the small model's "main melody" (the low-frequency core) and uses the Inverse DWT to reconstruct the full song.
  • The Secret Sauce: What about the missing high-frequency details? BoT simply fills them with silence (zeros).
  • Why this works: Instead of guessing what the details should be (which introduces errors), it leaves them blank. The Supercomputer then learns to fill in those blanks naturally during training, starting from a perfect, stable foundation. It's like taking a sketch and handing it to a master painter; the painter knows exactly how to add the details because the outline is perfect.

Why This is a Game-Changer

  1. One Tool for Both Jobs: Before, you needed a "chopper" for shrinking and a "stretcher" for growing. BoT is a single, unified tool that does both perfectly.
  2. No Extra Training Needed: The method is "parameter-free." You don't need to train a separate AI to figure out how to transfer the knowledge. It happens instantly using math.
  3. Huge Savings: Because the models start with such a good "head start," they don't need to be trained as long.
    • Small-to-Large: Saves up to 67% of the computing power (FLOPs).
    • Large-to-Small: Saves up to 52% of the computing power.

The Real-World Impact

The authors tested this on famous AI models (like BERT for language and DeiT for images).

  • Result: The models learned faster, used less electricity, and performed better on tests (like answering questions or identifying flowers) than models built from scratch or using old methods.
  • Visual Proof: When they looked at what the models were paying attention to, the BoT models focused on the actual object (like a bird's beak) immediately, whereas random models were looking at the background noise.

Summary in One Sentence

BoT treats AI knowledge like a universal language of sound waves, allowing us to instantly shrink a giant brain or expand a tiny one by simply filtering out or adding silence, saving massive amounts of time and energy in the process.