FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models

Imagine you are a master chef who has spent years perfecting a giant, 100-layer lasagna recipe. This recipe is so complex and delicious that everyone wants it. However, you face a problem: some people only have small ovens and can only cook a 4-layer version, while others have massive industrial kitchens and can handle a 20-layer version.

In the world of AI, this "recipe" is a Diffusion Model (a type of AI that creates images), and the "layers" are the parts of the brain that make the AI smart.

The Problem: One Size Doesn't Fit All

Usually, if you want a 4-layer AI, you have to train it from scratch. It's like teaching a new chef to make a 4-layer lasagna from zero, even though you already have the perfect 100-layer recipe. This takes forever and uses a lot of electricity (computing power).

If you try to just chop the 100-layer recipe down to 4 layers, it often tastes terrible because the "flavor" (the knowledge) gets lost or mixed up.

The Solution: FINE (Factorizing Knowledge)

The paper introduces FINE, a clever new way to train AI. Instead of writing one giant, rigid recipe, FINE teaches the AI to break its knowledge down into two distinct parts:

The "Learngenes" (The Universal Flavor): Think of these as the core ingredients and fundamental cooking techniques that never change, no matter how big or small the lasagna is. Is it a 4-layer or a 20-layer dish? You still need the same perfect tomato sauce, the same way to layer the cheese, and the same oven temperature. These are the size-agnostic parts.
The "Sigma" (The Portion Control): This is the part that changes based on the size of the oven. It's just a simple instruction on how much of the universal flavor to use for a specific layer.

How It Works (The Analogy)

Step 1: The Master Class (Pre-training)
Instead of training a specific 10-layer AI, the researchers train a "Universal Chef." This chef learns the Learngenes (the core techniques) and how to adjust them for different layers. This is a one-time, expensive effort, but it's worth it.

Step 2: Instant Deployment (Initialization)
Now, imagine a customer walks in and says, "I need a 6-layer lasagna."

Old Way: You hire a new chef and make them train for months.
FINE Way: You take your Universal Chef's Learngenes (the frozen, perfect techniques) and just quickly write a tiny note (the Sigma) telling the chef exactly how to apply those techniques to 6 layers. You don't need to retrain the whole chef; you just tweak the note.

Why Is This a Big Deal?

Speed: It's like having a "copy-paste" button for intelligence. You can create a tiny AI for a phone or a huge AI for a supercomputer in minutes, not months.
Efficiency: The paper shows that FINE can get a model ready 3 times faster than traditional methods.
Flexibility: Because the "Learngenes" are universal, they work even if you change the task. The paper shows that the same "Universal Chef" trained on making images of cats can be quickly adapted to make images of dogs, or even used for medical scans, just by adjusting the "Sigma" note.

The "DNA" Metaphor

Think of the Learngenes as the DNA of a species. A human, a chimp, and a gorilla all share a lot of the same DNA (the universal knowledge). The differences between them are just small genetic tweaks (the Sigma). FINE realizes that instead of growing a whole new organism from scratch, you just need to take the shared DNA and apply the specific tweaks for the size you need.

In Summary

FINE is a method that stops AI developers from reinventing the wheel every time they need a different-sized model. It separates the "eternal wisdom" of the AI from the "specific settings" of the model size. This allows us to instantly spawn high-quality AI models of any size, saving time, money, and energy, while still producing top-tier results.

1. Problem Statement

Training diffusion models is computationally intensive, making effective pre-training essential. However, real-world deployment often requires models of variable sizes to accommodate diverse hardware constraints (e.g., mobile vs. cloud servers).

The Challenge: Pre-trained diffusion models are typically released in fixed scales. If a specific size is unavailable, practitioners must either train from scratch (prohibitively expensive) or use inefficient transfer methods.
Limitations of Existing Methods:
- Parameter-Efficient Fine-Tuning (PEFT): Relies on existing pre-trained backbones of specific sizes, lacking flexibility for new architectures.
- Existing Learngene Methods: Most current approaches (e.g., Heur-LG, Auto-LG) use heuristic, layer-isolated strategies. They manually select and stack layers from pre-trained models. This fails to capture cross-layer dependencies and temporal coherence essential for diffusion processes, leading to suboptimal initialization for variable-sized models.

2. Methodology: FINE

The authors propose FINE (Factorizing Knowledge for INitialization of diffusion models with variable sizEs), a pre-training framework that decomposes model knowledge into reusable, size-agnostic units called "learngenes."

Core Concept: Knowledge Factorization

Instead of optimizing a full-parameter model, FINE represents the weight matrices of each layer as a product of three components, formally akin to Singular Value Decomposition (SVD) but with a crucial cross-layer constraint:
$W^{(l)}_{\star} = U_{\star} \Sigma^{(l)}_{\star} V^{\top}_{\star}$

$U_{\star}$ and $V_{\star}$ (Learngenes): These are shared singular vectors across all layers of the same type (e.g., all QKV layers share the same $U_{QKV}$ and $V_{QKV}$ ). They encapsulate size-agnostic, reusable knowledge.
$\Sigma^{(l)}_{\star}$ (Layer-specific): These are diagonal matrices of singular values unique to each layer $l$ . They are lightweight and adapt the shared representation to specific layer requirements.

Two-Stage Process

Stage 1: Knowledge Factorization (Pre-training)
- The model is trained on a large dataset (e.g., ImageNet-1K) under the constraint that weights must be reconstructible via the factorization rule above.
- The optimization jointly updates $U$ , $V$ , and all $\Sigma^{(l)}$ .
- Cost: This is a one-time cost. The resulting $U$ and $V$ (the Learngenes) are frozen after this stage.
Stage 2: Variable-Sized Initialization
- To initialize a new model of a different size (different depth or width), the shared Learngenes ( $U, V$ ) are frozen.
- Only the lightweight $\Sigma^{(l)}_{\star}$ matrices are randomly initialized and trained on a small amount of data (e.g., 100K steps vs. 300K for full pre-training).
- This allows the model to "recombine" the shared knowledge to fit the new architecture without manual layer selection.

3. Key Contributions

Novel Pre-training Framework: FINE is the first method to apply the Learngene paradigm to diffusion models, enabling the factorization of knowledge into size-agnostic components shared across layers.
Cross-Layer Weight Sharing: Unlike previous SVD-based methods (e.g., SVDiff, KIND) that decompose layers independently, FINE enforces shared singular vectors across layers, capturing the hierarchical and temporally coupled nature of diffusion processes.
First Comprehensive Benchmark: The authors constructed the first benchmark specifically for evaluating the initialization capability of Learngenes in image generation tasks.
Manual-Free Adaptation: The method eliminates the need for heuristic layer selection, allowing data-driven, flexible initialization for any model size.

4. Experimental Results

Experiments were conducted primarily on Diffusion Transformers (DiT) (DiT-B and DiT-L) and extended to classification tasks with DeiT.

State-of-the-Art Performance:
- FINE consistently outperforms existing initialization methods (Direct, Transfer, and other Learngene methods) across all model sizes (L4 to L12).
- Metrics: Achieved up to 4.89 reduction in FID and 2.11 increase in Inception Score (IS) for DiT-B compared to the best baseline.
- Efficiency: Models initialized by FINE and trained for only 100K steps outperformed models trained from scratch for 300K steps.
Scalability and Speedup:
- For $n$ target models, direct pre-training requires $300K \times n $steps. FINE requires$ 300K + 100K \times n $steps, achieving a **$ 3n\times$ training speedup**.
- The one-time pre-training cost is negligible when reused across multiple downstream tasks.
Transferability:
- FINE demonstrated strong generalization to diverse domains (CelebA, LSUN, Hubble, MRI, Pokemon) and classification tasks (DeiT), often outperforming models fine-tuned directly from pre-trained weights despite transferring only ~35% of parameters.
Convergence:
- FINE-initialized models converge significantly faster, reaching lower FID scores earlier in training compared to random initialization or other transfer methods.

5. Significance

Democratizing Deployment: FINE solves the "missing model size" problem, allowing organizations to deploy diffusion models on heterogeneous hardware without retraining from scratch.
Efficiency: It drastically reduces the computational cost and time required to adapt diffusion models to new constraints, making high-quality generative AI more accessible.
Theoretical Insight: The work validates that diffusion models contain size-agnostic knowledge that can be decoupled from layer-specific variations, challenging the notion that model scaling requires entirely new pre-training.
Versatility: The framework is architecture-agnostic (demonstrated on both DiT and DeiT), suggesting broad applicability beyond just image generation.

In summary, FINE provides a scalable, efficient, and robust solution for initializing variable-sized diffusion models by factorizing knowledge into reusable "learngenes," effectively bridging the gap between fixed pre-trained models and diverse deployment requirements.