Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

The Big Picture: Speeding Up the Artist

Imagine a world-class artist (the Teacher Model) who can paint a masterpiece, but it takes them 50 hours to finish one. They are slow, but the result is perfect.

We want a student artist (the Student Model) who can paint just as well, but in 1 or 2 hours. This process of teaching the student to mimic the teacher is called Distillation.

For a long time, scientists had two main ways to teach this student:

The "Copycat" Method (Consistency Models): Tell the student, "Don't worry about the middle steps; just look at the final painting and guess the start." This is fast, but the student often gets the details wrong (blurry text, weird shapes).
The "Critic" Method (Score Distillation/GANs): Tell the student, "Paint something, and I'll critique it against the teacher's work." This makes the details sharp, but the student gets too scared to be creative, so they all start painting the exact same thing (boring, repetitive results).

This paper introduces a new method called rCM. It combines the best of both worlds: the speed of the Copycat and the sharpness of the Critic, without the downsides.

The Problem: The "Blurry Detail" Trap

The researchers first tried to scale up the "Copycat" method (called sCM) to massive models (14 billion parameters!) that can generate videos and complex images.

The Issue:
Imagine asking a student to draw a tiny, intricate watch face with the time "11:44 AM" on it.

The sCM student could draw the watch quickly, but the numbers were blurry, and the hands were crooked.
Why? The student was trying to learn a complex mathematical path in one giant leap. As the "leap" got bigger (to handle huge models), tiny math errors piled up, like a snowball rolling down a hill, eventually ruining the fine details.

The Solution: The "Safety Net" (rCM)

To fix this, the authors added a Score-Regularizer. Think of this as a Safety Net or a Spotter in gymnastics.

The Main Act (sCM): The student tries to jump from the start to the finish in one go (fast, but risky).
The Safety Net (Score Distillation): While the student is jumping, a "Critic" gently checks their form. If the student starts to drift off course (losing detail), the Critic gives a tiny nudge to correct them.

The Magic:

The Jump ensures the student stays diverse and creative (doesn't get stuck painting the same thing).
The Safety Net ensures the details (like text or object shapes) remain sharp and accurate.

The result? A student who can paint a masterpiece in 1 to 4 steps (instead of 50) that looks just as good as the teacher, but with all the creative variety of the teacher.

The Technical Hurdle: The "Heavy Lifting"

Scaling this up to massive models (like 14 billion parameters) is like trying to run a marathon while carrying a piano.

The Challenge:
The "Copycat" method requires a specific, heavy math operation called a Jacobian-Vector Product (JVP). In standard computer setups, this operation is slow and breaks when you try to split the work across many computers (parallelism). It's like trying to pass a heavy piano through a narrow hallway; it gets stuck.

The Fix:
The team built a custom FlashAttention-2 JVP Kernel.

Analogy: Imagine instead of carrying the piano through the hallway, they built a specialized conveyor belt that fits the piano perfectly and moves it instantly.
This allowed them to train these massive models on thousands of GPUs without the math breaking or the computer running out of memory.

The Results: What Can It Do?

The researchers tested this on two massive models:

Cosmos-Predict2: A giant image generator.
Wan2.1: A giant video generator (creating 5-second videos).

The Wins:

Speed: It generates images in 1 step and videos in 2 steps. That's a 15x to 50x speedup compared to the original slow models.
Quality: It can render tiny text (like "Casio G-Shock" on a watch) perfectly, which previous fast methods failed at.
Diversity: Unlike other fast methods that make everything look the same, rCM keeps the variety. If you ask for "a cat," you get different cats, not the same cat 10 times.

Summary

Think of rCM as the ultimate Art Teacher.

It teaches the student to be fast (like a sprinter).
It teaches the student to be accurate (like a surgeon).
It teaches the student to be creative (like an artist).

By combining a "forward-looking" strategy with a "corrective" safety net, and building a custom engine to handle the heavy math, they have created a framework that makes high-quality, instant AI video and image generation a reality.

1. Problem Statement

While Continuous-Time Consistency Models (sCM) and related methods (e.g., MeanFlow) offer a theoretically principled approach to accelerating diffusion models (enabling few-step or single-step generation), their application to large-scale text-to-image (T2I) and text-to-video (T2V) tasks has been hindered by two main factors:

Infrastructure Challenges: The core of sCM training relies on Jacobian-Vector Products (JVP) to compute time derivatives. Standard JVP implementations are incompatible with modern large-scale training infrastructures, such as FlashAttention-2, Fully Sharded Data Parallel (FSDP), and Context Parallelism (CP). Furthermore, JVP computation is numerically unstable under BF16 precision (common in large models), leading to error accumulation.
Quality vs. Diversity Trade-off: Empirical investigations revealed that pure sCM distillation suffers from fine-detail degradation (blur, distortion, unstable geometry) in complex scenarios. The authors attribute this to the "mode-covering" nature of the forward-divergence objective used in sCM, which penalizes underestimation of likelihoods but accumulates errors over time, particularly in high-dimensional video tasks.

2. Methodology: Score-Regularized Continuous-Time Consistency (rCM)

The authors propose rCM, a framework that integrates score distillation as a regularizer into the continuous-time consistency framework to address the quality limitations of sCM.

A. Infrastructure and Scalability

To enable sCM training on models with >10 billion parameters and high-dimensional video data, the authors developed:

FlashAttention-2 JVP Kernel: A custom Triton kernel that integrates JVP computation directly into the FlashAttention-2 forward pass. This supports both self- and cross-attention while maintaining memory efficiency.
Parallelism Compatibility: Refactored network layers to support JVP within FSDP (by exposing tangent inputs/outputs at the layer level) and Context Parallelism (CP) (by distributing QKV tangents across GPUs).
Stable Time Derivative Computation: To mitigate numerical instability in BF16, they introduced:
- Semi-Continuous Time: Using finite differences for the time derivative term $\partial_t F$ while computing the spatial gradient via exact JVP.
- High-Precision Time: Enforcing FP32 precision specifically for time embedding layers during the JVP computation to prevent error accumulation in 10B+ models.

B. The rCM Objective

The core innovation is the combination of two divergence types:

Forward Divergence (sCM): Ensures consistency along the teacher's ODE trajectory. This promotes high diversity but suffers from error accumulation and lower quality.
Reverse Divergence (Score Distillation): Uses a "fake score" network to supervise the student on self-generated samples (similar to DMD/SiD). This is "mode-seeking," promoting high quality but risking mode collapse.

The rCM loss is a weighted sum:
$\mathcal{L}_{rCM}(\theta) = \mathcal{L}_{sCM}(\theta) + \lambda \mathcal{L}_{DMD}(\theta)$
Where $\mathcal{L}_{DMD}$ acts as a long-skip regularizer. The authors found that a small weight ( $\lambda = 0.01$ ) is sufficient to correct the quality issues of sCM without sacrificing its diversity benefits.

C. Training Strategy

Rollout: The student generates samples via a stochastic multi-step simulation (alternating reverse denoising and forward noising) to provide data for the DMD loss.
No GAN Tuning: Unlike adversarial distillation methods, rCM does not require complex GAN tuning or multi-stage training.

3. Key Contributions

First Large-Scale sCM Implementation: Successfully scaled continuous-time consistency to 14B parameter models and 5-second video generation, overcoming previous infrastructure bottlenecks.
FlashAttention-2 JVP Kernel: A novel kernel enabling efficient JVP computation in modern attention architectures, making large-scale consistency distillation feasible.
rCM Framework: A theoretically grounded method that unifies forward (consistency) and reverse (score) divergences. It resolves the "blur/distortion" issues of sCM while avoiding the "mode collapse" of pure score distillation (DMD2).
Empirical Validation: Demonstrated that rCM matches or surpasses the state-of-the-art DMD2 in quality metrics while offering superior diversity, all without extensive hyperparameter search.

4. Experimental Results

The method was validated on Cosmos-Predict2 (T2I, up to 14B params) and Wan2.1 (T2V, up to 14B params).

Text-to-Image (GenEval):
- rCM (4 steps) on the 14B Cosmos model achieved an overall score of 0.83, matching the teacher's performance and outperforming other few-step models like FLUX.1-schnell in fine-detail rendering (e.g., text).
- It achieved competitive results even in 1-step generation.
Text-to-Video (VBench):
- On Wan2.1 14B, rCM (4 steps) achieved a total score of 84.92, surpassing the 480p teacher model (83.58).
- Diversity: Unlike DMD2, which often produced collapsed generations (objects converging to similar positions), rCM maintained the high diversity of the teacher model.
- Speed: Achieved 15× to 50× acceleration over teacher models (e.g., generating 5-second videos in 1–4 steps).
Ablation: The balancing weight $\lambda$ was critical. $\lambda=0.01$ provided the "sweet spot" between quality and diversity.

5. Significance

This work bridges the gap between theoretical consistency models and practical, large-scale generative AI applications.

Practicality: It provides a plug-and-play framework for distilling massive video and image models without the engineering complexity of GANs or multi-stage training.
Theoretical Insight: It validates that combining forward-divergence (for diversity) and reverse-divergence (for quality) is a superior paradigm for diffusion distillation, potentially inspiring future research in generative modeling.
Efficiency: By enabling high-fidelity generation in 1–4 steps, rCM significantly lowers the inference cost for real-world applications like interactive world models and real-time video generation.