Scale-wise Distillation of Diffusion Models

Imagine you are trying to paint a masterpiece of a bustling city.

The Old Way (Standard Diffusion Models):
Currently, the best AI artists work like a very careful, slow painter. They start with a blank canvas covered in static noise. To create the image, they take 20 to 50 tiny steps. In every single step, they look at the entire canvas, from the tiniest brick on a building to the vast sky, and try to refine every single pixel at the same time. It's like trying to fix the cracks in a sidewalk while simultaneously painting the clouds. It produces beautiful results, but it takes a long time and uses a lot of energy.

The Problem:
Researchers have been trying to speed this up by teaching the AI to do it in fewer steps (like 4 steps instead of 20). But they hit a wall. If you force the AI to finish the whole painting in just 4 steps while looking at the whole canvas at once, the picture starts to look blurry or weird. The AI gets overwhelmed trying to do too much at once.

The New Solution: SwD (Scale-Wise Distillation)
This paper introduces a new method called SwD (Scale-wise Distillation). Think of SwD not as a painter who rushes, but as a painter who knows how to work from the outside in.

Here is how SwD works, using a simple analogy:

1. The "Zoom-Out" Strategy (Scale-Wise)

Imagine you are looking at a city through a telescope.

Step 1 (The Big Picture): You start with a very blurry, low-resolution view. You can only see the general shapes: "There's a mountain there, a river there, and a city block there." You don't need to see the windows yet.
Step 2 (Zooming In): Now, you zoom in a little. You refine the shapes. You can see the buildings, but not the windows.
Step 3 (Getting Closer): You zoom in further. Now you see the windows and the doors.
Step 4 (The Details): Finally, you zoom in all the way to see the people walking on the street.

Why is this faster?
In the old method, the AI had to calculate the details of the windows and the mountains at the very first step, even though the mountains were just blurry blobs. That's wasted effort!
With SwD, the AI only calculates the "mountain shapes" when it's at the low resolution. It only calculates the "window details" when it's at the high resolution. It avoids doing unnecessary math on details that don't exist yet.

2. The "Magic Checklist" (MMD Loss)

The paper also introduces a new way to teach the AI, called MMD (Maximum Mean Discrepancy).

Imagine you are teaching a student to draw a cat.

Old Way: You show the student a photo of a cat and say, "Draw exactly what you see, pixel by pixel." If the student makes a tiny mistake, they get corrected.
The SwD Way (MMD): You give the student a "Magic Checklist" (a pre-trained AI model). You tell the student, "Don't just copy the photo. Look at the vibe of the cat. Does it have the same 'cat-ness' as the photo? Are the ears in the right spot relative to the tail? Does the fur feel right?"

The MMD loss is like a sophisticated quality control inspector that checks if the overall feeling and structure of the drawing match the original, rather than just checking if every single pixel is identical. This helps the AI learn faster and produce better results, even with fewer steps.

The Result

By combining these two ideas:

Working from low-res to high-res (saving time by not calculating details too early).
Using a "vibe-check" checklist (learning the essence of the image faster).

The authors created models that generate images and videos 10 times faster than the original slow models, and 2 to 3 times faster than other fast models, without losing quality.

In a nutshell:
Instead of trying to build a skyscraper by laying every single brick at full size immediately, SwD builds the foundation first, then the frame, then the walls, and finally the windows. It's a smarter, more efficient way to build, resulting in a beautiful skyscraper in record time.

1. Problem Statement

Diffusion models (DMs) are state-of-the-art for image and video generation but suffer from slow inference speeds due to the requirement of sequential sampling over 20–50 steps. While recent distillation methods have successfully reduced this to ~4 steps, further reduction (e.g., to 1–2 steps) yields diminishing returns in quality.

The Bottleneck: Current few-step distillation methods operate at a fixed, full resolution throughout the entire diffusion process.
The Insight: The reverse diffusion process is inherently spectral autoregressive, meaning it generates low-frequency (coarse) structures first and high-frequency (fine) details later. At high noise levels (early timesteps), high-frequency information is masked by noise. Therefore, performing computations at full resolution during early, noisy timesteps is redundant and inefficient.
The Gap: Existing methods do not leverage this spectral property to reduce computational cost by operating at lower resolutions during early steps.

2. Methodology: Scale-Wise Distillation (SwD)

The authors propose SwD, a framework that transforms a pretrained diffusion model into a few-step generator that progressively increases spatial and temporal resolution during the sampling process.

A. Spectral Analysis of Latent Spaces

Before designing the method, the authors conducted a spectral analysis (using Radially Averaged Power Spectral Density) on VAE latent spaces of models like SD3.5 and Wan2.1.

Finding: High-frequency components are progressively filtered out by noise. At high noise levels (e.g., $t=800$ ), the signal can be accurately represented at significantly lower resolutions (e.g., $32\times32$ instead of $128\times128$ ) without information loss.
Implication: Diffusion models can operate at lower latent resolutions during high-noise timesteps.

B. The SwD Framework

SwD unifies multi-scale generation into a single few-step model.

Schedule: It defines a non-decreasing scale schedule $[s_1, \dots, s_N]$ paired with a timestep schedule $[t_1, \dots, t_N]$ .
Sampling Process:
- Generation starts at the lowest resolution $s_1$ with Gaussian noise.
- At each step, the model predicts a clean sample $\hat{x}_0$ .
- Upsampling Strategy: Instead of upsampling the noisy latent (which distorts noise statistics), the method upsamples the predicted clean sample $\hat{x}_0$ and then re-noises it to the next timestep $t_{i+1}$ at the higher resolution $s_{i+1}$ . This preserves correct noise statistics and minimizes artifacts.
- The process repeats, increasing resolution until the final target resolution is reached.
Training: The model is trained on pairs of adjacent resolutions. Low-resolution images are downsampled, encoded, upsampled, and noised to match the target timestep. The model learns to predict the clean target at the higher scale.

C. Novel Distillation Objective: Patch-Level MMD

To improve convergence and performance, the authors introduce a new loss function based on Maximum Mean Discrepancy (MMD) in the feature space of the teacher model.

Mechanism: Instead of matching raw pixel distributions, they extract feature maps from the intermediate transformer blocks of the teacher DM for both generated and target samples.
Noise Injection: Both samples are noised within a specific interval (low-to-mid noise) before feature extraction to capture structured signals.
Loss Calculation: They compute the MMD between the distributions of spatial tokens (patches). For efficiency, they use a linear kernel, effectively minimizing the Mean Squared Error (MSE) between the mean feature vectors of the patches per image.
Advantage: This objective requires no additional trainable discriminator models (unlike GAN-based distillation) and significantly improves convergence, acting as a strong standalone baseline.

3. Key Contributions

Scale-Wise Distillation Framework: A novel approach that adapts arbitrary pretrained DMs into progressive few-step models, reducing redundant computations at intermediate timesteps.
Spectral Autoregression in Latents: Empirical proof that latent diffusion processes follow spectral autoregression, validating the feasibility of low-resolution modeling at high noise levels.
MMD-Based Distillation Loss: A simple, efficient, and highly effective patch-level MMD loss that outperforms existing baselines and serves as a competitive standalone distillation objective.
Unified Pipeline: The method seamlessly integrates with existing distribution matching techniques (like DMD) and works for both text-to-image and text-to-video models.

4. Experimental Results

The authors evaluated SwD on state-of-the-art models: SDXL, SD3.5 (Medium/Large), FLUX.1-dev (Image) and Wan2.1 (Video).

Speed vs. Quality:
- Image: SwD achieves ~2x speedup compared to full-resolution few-step models (e.g., 4-step SwD is as fast as 2-step full-res) while maintaining or exceeding quality.
- Video: SwD achieves ~3x speedup in text-to-video generation compared to full-resolution counterparts.
- Comparison: SwD models compete with or outperform their teachers (which take 20–50 steps) while being >10x faster. They also outperform other fast distillation methods (e.g., DMD2, Turbo, Hyper-SD) in human preference studies regarding image complexity and aesthetics.
Human Preference: In side-by-side studies, SwD models were preferred for image complexity and aesthetics over full-resolution baselines with the same number of steps.
Video Performance: On the Wan2.1 video model, SwD was 72x faster than the teacher and 2.3x faster than the 3-step CausVid baseline, with comparable or better quality metrics (VBench-2.0, VisionReward).
Ablation: The MMD loss alone ( $L_{MMD}$ ) proved to be a powerful standalone distillation objective, achieving performance close to the full SwD pipeline but with 7x faster training iterations (as it avoids training auxiliary models).

5. Significance

Efficiency Paradigm Shift: SwD moves beyond the "reduce steps" race by optimizing the "resolution per step" axis. It demonstrates that efficiency gains can be mined by adapting the model's operating scale to the information content of the noise level.
Scalability: The method is applicable to both image and video generation, addressing the specific challenge of temporal resolution scaling in video DMs.
Practical Deployment: By enabling high-quality generation in fewer steps with lower computational budgets, SwD makes high-fidelity diffusion models more viable for real-time and resource-constrained applications.
Loss Function Innovation: The introduction of the patch-level MMD loss provides a new, lightweight, and effective tool for the diffusion distillation community, potentially reducing the reliance on complex adversarial training setups.

In summary, SwD represents a significant step forward in diffusion model acceleration, proving that progressive generation combined with feature-space distribution matching can yield superior speed-quality trade-offs compared to traditional fixed-resolution distillation.

Scale-wise Distillation of Diffusion Models

1. The "Zoom-Out" Strategy (Scale-Wise)

2. The "Magic Checklist" (MMD Loss)

The Result

1. Problem Statement

2. Methodology: Scale-Wise Distillation (SwD)

A. Spectral Analysis of Latent Spaces

B. The SwD Framework

C. Novel Distillation Objective: Patch-Level MMD

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization