A Unified Framework for Knowledge Transfer in Bidirectional Model Scaling

The Big Problem: The "Model Zoo" Dilemma

Imagine a massive library of AI models, called a "Model Zoo." Inside, you have tiny, fast models (like a Smartphone) and huge, powerful models (like a Supercomputer).

Usually, if you want to use the knowledge from the Supercomputer to help the Smartphone, or vice versa, you hit a wall.

The Smartphone problem: You can't just copy the Supercomputer's brain into the Smartphone; it's too big and won't fit.
The Supercomputer problem: You can't just copy the Smartphone's brain into the Supercomputer; it's too small and leaves empty space.

Currently, scientists have to use two completely different, messy tools to fix this:

For Small-to-Large: They try to "stretch" the small brain, often by copying neurons or using complex math to guess how to fill the gaps. It's like trying to stretch a small rubber band to fit a giant balloon—it often snaps or looks weird.
For Large-to-Small: They try to "chop" the big brain, picking random pieces to keep. It's like trying to make a salad by randomly grabbing leaves from a giant tree; you might lose the important nutrients.

The Solution: BoT (Bidirectional Knowledge Transfer)

The authors propose a new method called BoT. Their big idea is to stop thinking of AI models as rigid blocks of code and start thinking of them as continuous signals, like a radio wave or a music track.

The Core Analogy: The "Learngene" as a Music Track

Imagine the knowledge inside a trained AI model is a song.

The "Learngene" is the core melody of that song.
The Model Size is just the resolution or quality of the audio file.
A Small Model (Low Resolution): It's like a low-quality MP3 or a blurry thumbnail image. It captures the main melody (the low-frequency notes) but misses the tiny details.
A Large Model (High Resolution): It's like a high-fidelity FLAC file or a 4K image. It has the main melody plus all the high-frequency details and nuances.

The Magic Trick:
The authors realized that moving between these sizes isn't about chopping or stretching. It's about changing the volume and frequency, just like a sound engineer does with music.

How BoT Works: The Wavelet Wave

They use a mathematical tool called the Discrete Wavelet Transform (DWT). Think of this as a Magic Zoom Lens or a Sound Mixer.

1. Large-to-Small (The "Downsampling" or "Compression")

Goal: Take the Supercomputer's brain and shrink it for the Smartphone without losing the main ideas.
How BoT does it: It uses the DWT to filter out the "high-frequency noise" (the tiny, specific details) and keeps only the "low-frequency core" (the main melody).
The Result: You get a compact, efficient version of the knowledge that fits perfectly into the smaller model. It's like taking a 4K movie and compressing it into a clear, standard-definition video that still tells the whole story perfectly.

2. Small-to-Large (The "Upsampling" or "Expansion")

Goal: Take the Smartphone's brain and expand it to fit the Supercomputer.
How BoT does it: It takes the small model's "main melody" (the low-frequency core) and uses the Inverse DWT to reconstruct the full song.
The Secret Sauce: What about the missing high-frequency details? BoT simply fills them with silence (zeros).
Why this works: Instead of guessing what the details should be (which introduces errors), it leaves them blank. The Supercomputer then learns to fill in those blanks naturally during training, starting from a perfect, stable foundation. It's like taking a sketch and handing it to a master painter; the painter knows exactly how to add the details because the outline is perfect.

Why This is a Game-Changer

One Tool for Both Jobs: Before, you needed a "chopper" for shrinking and a "stretcher" for growing. BoT is a single, unified tool that does both perfectly.
No Extra Training Needed: The method is "parameter-free." You don't need to train a separate AI to figure out how to transfer the knowledge. It happens instantly using math.
Huge Savings: Because the models start with such a good "head start," they don't need to be trained as long.
- Small-to-Large: Saves up to 67% of the computing power (FLOPs).
- Large-to-Small: Saves up to 52% of the computing power.

The Real-World Impact

The authors tested this on famous AI models (like BERT for language and DeiT for images).

Result: The models learned faster, used less electricity, and performed better on tests (like answering questions or identifying flowers) than models built from scratch or using old methods.
Visual Proof: When they looked at what the models were paying attention to, the BoT models focused on the actual object (like a bird's beak) immediately, whereas random models were looking at the background noise.

Summary in One Sentence

BoT treats AI knowledge like a universal language of sound waves, allowing us to instantly shrink a giant brain or expand a tiny one by simply filtering out or adding silence, saving massive amounts of time and energy in the process.

1. Problem Statement

Current deep learning practices rely heavily on "model zoos" of pre-trained models with fixed architectural sizes (e.g., Base, Large). Transferring knowledge between models of different sizes is a critical bottleneck for efficient scaling, yet existing solutions are fragmented:

Small-to-Large (S2L): Methods like bert2BERT, LiGO, and Mango treat this as a parameter synthesis problem. They use layer stacking or trainable mapping functions to expand weights. These approaches often introduce extra training overhead or fail to preserve structural integrity.
Large-to-Small (L2S): Methods like Weight Selection (WS) treat this as a parameter selection problem, heuristically sampling subsets of weights from a large model. This often disrupts learned structural patterns and inter-dependencies between parameters.

The Core Gap: There is no unified framework that treats S2L and L2S as two sides of the same coin. Current methods are incompatible, specialized, and often computationally expensive (requiring trainable mappers or repeated teacher inference).

2. Methodology: BoT (Bidirectional Knowledge Transfer)

The authors propose BoT, a size-agnostic, parameter-free framework that unifies S2L and L2S scaling.

Core Insight: The "Learngene" as a Signal

The authors hypothesize that pre-trained model weights represent a continuous signal residing on a low-dimensional manifold.

Different sizes are discretizations: A small model captures the low-resolution, global approximation (low-frequency) of this knowledge. A large model captures the same knowledge plus high-resolution, task-dependent details (high-frequency).
The Learngene: The intrinsic, transferable core of knowledge is the low-frequency spectrum of this signal.

Technical Mechanism: 3D Discrete Wavelet Transform (DWT)

BoT leverages the 3D Discrete Wavelet Transform (DWT) and its inverse (IDWT) to manipulate these signals across dimensions.

Parameter Consolidation:
- For Transformer architectures (e.g., BERT, GPT, DeiT), weights from corresponding layers (e.g., Query, Key, Value projections) are stacked to form 3D tensors ( $L \times d_{in} \times d_{out}$ ).
Large-to-Small (L2S) via Decomposition (DWT):
- Process: Apply 3D-DWT to the large source model's consolidated weights.
- Mechanism: The DWT separates the signal into a low-frequency approximation sub-band ( $cA$ ) and seven high-frequency detail sub-bands ( $cD$ ).
- Transfer: The $cA$ sub-band (which is downsampled) is extracted as the compact "learngene." This is directly used to initialize the smaller target model.
- Benefit: This is a training-free operation that preserves the global structural essence of the large model without heuristic sampling.
Small-to-Large (S2L) via Synthesis (IDWT):
- Process: Treat the small source model's weights as the low-frequency approximation ( $cA$ ).
- Mechanism: Set all high-frequency detail sub-bands ( $cD$ ) to zero tensors. Apply the Inverse DWT (IDWT).
- Transfer: The IDWT reconstructs a full-sized weight tensor matching the target architecture.
- Benefit: The target model inherits the stable, coherent foundation of the small model. The zeroed high-frequency bands act as a "blank canvas" for the larger model to learn new details during pre-training, avoiding the distributional shifts caused by naive weight copying.

3. Key Contributions

Unified Framework: BoT is the first algorithm to treat S2L and L2S as bidirectional signal processing operations (upsampling and downsampling) within a single, parameter-free framework.
Theoretical Novelty: It introduces the concept of model weights as continuous signals and applies 3D Wavelet Transforms specifically for cross-architecture initialization, moving beyond traditional data-domain frequency analysis.
Efficiency: The method requires no additional training for the transfer process itself (unlike LiGO/Mango) and avoids the computational cost of teacher-student distillation loops.
Architecture Agnostic: It works across diverse architectures including Vision Transformers (DeiT), Encoders (BERT, RoBERTa), and Decoders (GPT-2, LLaMA).

4. Experimental Results

Extensive experiments were conducted on DeiT, BERT, GPT-2, and LLaMA, comparing BoT against state-of-the-art baselines (Scratch, KD, WS, bert2BERT, LiGO, Mango).

Pre-training Efficiency (FLOPs Savings)

BoT significantly reduces the computational cost required to reach a target performance level:

S2L (Expansion):
- BERT: Saved 67.1% of pre-training FLOPs.
- GPT: Saved 58.3% of FLOPs.
- DeiT: Saved 22.0% of FLOPs.
L2S (Adaptation):
- BERT: Saved 52.8% of FLOPs.
- DeiT: Saved 39.0% of FLOPs.
- GPT: Saved 31.0% of FLOPs.

Downstream Performance

Models initialized with BoT achieved State-of-the-Art (SOTA) results on downstream benchmarks:

GLUE & SQuAD (NLP): BoT-initialized BERT and RoBERTa models outperformed trainable expansion methods (LiGO, Mango) and heuristic selection methods (WS) on both average GLUE scores and SQuAD F1/EM metrics.
Vision Benchmarks: On 7 downstream datasets (including fine-grained tasks like CUB-200 and Stanford Cars), BoT-initialized models showed superior transferability, particularly in preserving high-resolution features necessary for fine-grained recognition.

Ablation Studies

Wavelet Families: The optimal wavelet choice depends on the architecture and direction (e.g., Haar is optimal for BERT L2S, while bior6.8 is best for BERT S2L). However, the wavelet-based approach is robust across all tested families compared to random initialization.
Padding Strategy: Setting high-frequency coefficients to zero (Zero Padding) was found to be superior to random Gaussian or Uniform padding, providing a cleaner initialization that avoids noise.

5. Significance

Paradigm Shift: BoT moves the field away from treating model scaling as a collection of ad-hoc, direction-specific tricks. It establishes a principled, mathematical foundation for knowledge transfer based on signal processing.
Resource Efficiency: By drastically reducing the FLOPs and Walltime required to train large models or adapt large models to smaller devices, BoT addresses the "Green AI" challenge, making model scaling more sustainable and accessible.
Generalization: The framework's ability to bridge architectural gaps (e.g., transferring knowledge from GPT to BERT) suggests that the "learngene" is a fundamental property of neural networks, decoupled from specific architectural dimensions.

In conclusion, BoT provides a unified, parameter-free, and highly efficient solution for bidirectional model scaling, demonstrating that treating weights as continuous signals via wavelet transforms is a powerful mechanism for knowledge transfer.