One-for-All Model Initialization with Frequency-Domain Knowledge

The Big Problem: The "One-Size-Fits-None" Dilemma

Imagine you have a master chef who has spent 10 years perfecting a complex recipe for a giant, 50-course banquet (a massive, pre-trained AI model). This chef knows everything about cooking: how to chop, how to season, how to balance flavors.

Now, imagine you want to open a small food truck (a smaller AI model) or a massive catering hall (a larger AI model).

The Old Way: You try to copy the master chef's entire 50-course menu onto your small food truck. It doesn't fit! Or, you try to shrink the giant banquet down to a single sandwich, but you lose all the flavor.
The Current "Smart" Ways:
- Cut-and-Paste: You try to grab just the "chopping" section of the chef's notes and hope that's enough. But cooking is interconnected; chopping without knowing the seasoning ruins the dish.
- The "Magic Generator": You hire a robot to study the chef's notes and guess what the food truck's menu should look like. But this robot needs to study thousands of other chefs first, takes forever to learn, and often gets it wrong.

The Result: Starting a new AI model from scratch is slow and expensive. Trying to adapt a big model to a small one (or vice versa) is messy and usually fails.

The Solution: The "Learngene" (The DNA of Cooking)

The authors of this paper, FRONT, discovered something fascinating. They realized that the chef's true knowledge isn't in the specific details of the 50th course (like "how to garnish this specific strawberry"). That's just "noise" or high-frequency detail.

The real, fundamental knowledge—the "Learngene"—is in the low-frequency components.

Analogy: Think of a song. The high frequencies are the specific notes, the vibrato, and the unique instruments. The low frequencies are the melody and the rhythm. You can play a melody on a piano, a guitar, or a synthesizer, and it's still the same song.
The Discovery: The "essence" of what the AI has learned (how to recognize a cat, how to understand grammar) is encoded in these smooth, low-frequency patterns of the math inside the model.

How FRONT Works: The "Frequency Filter"

The paper proposes a new framework called FRONT (FRequency dOdomain kNowledge Transfer). Here is how it works, step-by-step:

1. The Magic Filter (DCT)

Imagine the AI's brain is a giant, complex painting.

FRONT uses a mathematical tool called the Discrete Cosine Transform (DCT). Think of this as a special filter that separates the painting into two piles:
- Pile A (Low Frequency): The broad strokes, the main shapes, the core composition. (This is the "Learngene").
- Pile B (High Frequency): The tiny specks of dust, the specific brush textures, the noise. (This is the task-specific detail).
FRONT throws away Pile B and keeps Pile A. This is the "Learngene."

2. The Shape-Shifter (Truncation & Padding)

Now you have a "Learngene" (a set of smooth, core patterns). You want to use it to start a new AI model that is a different size.

If the new model is smaller: You simply crop the edges of the Learngene. Since the core knowledge is in the center (low frequency), you don't lose the important stuff.
If the new model is bigger: You pad the Learngene with zeros (blank space) around the edges. The core patterns stay intact, and the new empty space is ready to be filled with new details later.
The Magic: This happens in milliseconds on a regular computer. No training required!

3. The "Refinement" (FRONT+)

Sometimes, the "Learngene" might still have a little bit of "noise" from the original task.

FRONT+ is like a quick polish. It takes the original model and runs a very short, cheap training session where it tells the model: "Hey, forget the specific details of this task. Focus only on the smooth, general patterns."
This creates an even cleaner "Learngene" that works even better.

Why This is a Game-Changer

Speed: It's like getting a head start in a race. Instead of running from the starting line (training from scratch), you are already 90% of the way there.
- Real-world impact: In vision tasks, models trained with FRONT learned 15 times faster. In language tasks, they saved 40% of the computing power.
Flexibility: You can take a model trained to recognize dogs and instantly use its "Learngene" to start a model that recognizes cats, or a model that is twice as big, or half as big.
No "Magic" Needed: Unlike other methods that require training a giant "generator" robot, FRONT just uses math to filter the existing model. It's simple, fast, and free.

The Bottom Line

The authors found that the "soul" of an AI model is hidden in its low-frequency math. By extracting this soul, they created a universal "starter kit" (the Learngene) that can be instantly resized to fit any new AI project.

In short: They figured out how to distill the "wisdom" of a giant AI into a tiny, portable seed that can grow into any size of tree, instantly. This saves massive amounts of time, money, and energy in the world of Artificial Intelligence.

1. Problem Statement

The current paradigm of deep learning relies heavily on fine-tuning large-scale pre-trained models. However, this approach faces a critical bottleneck: knowledge is tightly coupled with monolithic architectures. Transferring knowledge from a pre-trained model to a downstream model of a different scale (varying depth or width) is inefficient.

Existing Limitations:
- Parameter Selection: Methods that select subsets of weights (e.g., layer pruning) fail to capture the interdependent structure of the knowledge.
- Generative Models: Approaches using hypernetworks to predict weights require access to large collections of pre-trained models and incur high computational costs.
- Learngene Gap: While the concept of a "learngene" (a compact, task-agnostic knowledge representation) exists, current implementations are either indirect, rely on rigid structural fragments, or require expensive auxiliary training.

There is a need for a method that can extract task-agnostic, foundational knowledge from a single pre-trained model and initialize arbitrary-sized downstream models without additional training.

2. Core Insight

The authors empirically demonstrate that a model's foundational knowledge (the "learngene") is encoded in the low-frequency components of its weight matrices.

Empirical Evidence: By analyzing the spectral dynamics of parameter updates during fine-tuning (using Discrete Cosine Transform), the authors observed that:
- Low-frequency components remain highly stable and similar to the original pre-trained state across different model scales and tasks.
- High-frequency components are volatile and task-specific, changing significantly during adaptation.
Hypothesis: The low-frequency spectrum contains the universal, architecture-agnostic knowledge necessary for initialization, while high-frequency details represent specific task nuances.

3. Methodology: FRONT

The paper proposes FRONT (FRequency dOdomain kNowledge Transfer), a framework that utilizes the Discrete Cosine Transform (DCT) to isolate and transfer this low-frequency knowledge.

A. Theoretical Foundation

The method applies 3D-DCT (for Vision Transformers) or 2D/4D-DCT (for MLPs/CNNs) to weight tensors. This transforms spatial weights into the frequency domain, separating core information (low frequencies) from high-frequency details.

B. Two-Stage Strategy

The framework offers two modes of operation:

FRONT (Direct Extraction):
- Process: Takes an off-the-shelf pre-trained model, applies 3D-DCT to all weight matrices, applies a binary mask to retain only the low-frequency coefficients (based on a ratio $r$ ), and discards high frequencies.
- Initialization: The resulting "learngenes" are resized (via zero-padding for larger models or truncation for smaller models) in the frequency domain and transformed back to the spatial domain using Inverse DCT (IDCT).
- Cost: Zero training cost. The process is entirely training-free and takes milliseconds on a CPU.
FRONT+ (Refinement with Regularization):
- Motivation: Hard truncation in direct extraction can introduce spectral artifacts.
- Process: Instead of hard truncation, a source model is briefly fine-tuned (or trained from scratch) with a spectral regularizer.
- Regularization Loss ( $L_{reg}$ ): Penalizes the energy of high-frequency coefficients during optimization, encouraging the model to naturally converge toward a low-frequency representation while preserving task performance.
- Extraction: After refinement, the low-frequency coefficients are extracted as the final learngenes.

C. Variable-Size Adaptation

A key feature is the ability to initialize models of arbitrary depth and width.

Frequency Domain Resizing: The extracted low-frequency coefficients are resized by padding (to increase dimensions) or truncating (to decrease dimensions) in the frequency domain.
Reconstruction: The resized frequency tensor is converted back to weights via IDCT. This allows a single set of learngenes to initialize models with different layer counts or hidden dimensions.

4. Key Contributions

Discovery of the "Learngene": The paper identifies and empirically validates that low-frequency weight components encode the transferable, task-agnostic knowledge of a neural network.
FRONT Framework: A novel, training-free initialization method that uses DCT to extract and transfer knowledge. It bridges the gap between the theoretical concept of a "learngene" and practical implementation.
One-for-All Scalability: The method enables initializing models of varying sizes (depth and width) from a single source model without retraining or complex generative models.
FRONT+ Refinement: An optional, low-cost refinement strategy using spectral regularization to enhance transferability and robustness.

5. Experimental Results

The authors evaluated FRONT across Vision (ViT, ResNet) and Language (BERT, RoBERTa, GPT-2) tasks.

Vision Tasks (ImageNet-1K):
- Convergence Speed: Models initialized with FRONT achieved the performance of a standard 150-epoch pre-training schedule in only 10 epochs, accelerating convergence by up to 15×.
- Accuracy: FRONT outperformed state-of-the-art baselines (including LiGO, WAVE, and GHN-3) in both direct initialization and full training scenarios.
- Generalization: Demonstrated strong performance on cross-dataset tasks (e.g., CUB, Cars, Food) and cross-domain tasks (Object Detection, Image Segmentation), often improving accuracy by 15–40% over random initialization.
- Cross-Architecture: Successfully transferred knowledge between different architectures (e.g., DeiT to Mega-ViT) and between Encoder/Decoder models (BERT to GPT).
Language Tasks:
- FLOPs Reduction: FRONT reduced the required training FLOPs by an average of 40.5% compared to training from scratch.
- GLUE Benchmark: FRONT-initialized models significantly outperformed both random initialization and Knowledge Distillation (KD) baselines across all GLUE tasks, with notable gains in tasks like MNLI (+9.27%) and STS-B (+36.96%).
Efficiency: The extraction process is computationally negligible (milliseconds), and the refinement process (FRONT+) requires only a fraction of the epochs needed for standard pre-training.

6. Significance

Democratization of Pre-training: FRONT allows researchers to leverage the knowledge of massive pre-trained models to initialize smaller or differently sized models without the prohibitive cost of re-training or complex distillation pipelines.
Theoretical Advancement: It shifts the perspective of model initialization from the spatial domain (selecting neurons/layers) to the frequency domain, providing a new lens for understanding how neural networks encode generalizable knowledge.
Practical Impact: By drastically reducing training time and computational resources (FLOPs), FRONT offers a sustainable and efficient path for deploying deep learning models in resource-constrained environments.

In summary, FRONT presents a paradigm shift in model initialization, proving that the "essence" of a neural network can be distilled into its low-frequency spectral components, enabling flexible, efficient, and high-performance knowledge transfer across diverse architectures and scales.