Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

This paper introduces the score-regularized continuous-time consistency model (rCM), which overcomes large-scale infrastructure and quality limitations of existing methods via a parallelism-compatible JVP kernel and a novel score-regularization objective, enabling high-fidelity, diverse video and image generation in just 1–4 steps on models up to 14 billion parameters.

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang

Published 2026-02-17
📖 5 min read🧠 Deep dive

The Big Picture: Speeding Up the Artist

Imagine a world-class artist (the Teacher Model) who can paint a masterpiece, but it takes them 50 hours to finish one. They are slow, but the result is perfect.

We want a student artist (the Student Model) who can paint just as well, but in 1 or 2 hours. This process of teaching the student to mimic the teacher is called Distillation.

For a long time, scientists had two main ways to teach this student:

  1. The "Copycat" Method (Consistency Models): Tell the student, "Don't worry about the middle steps; just look at the final painting and guess the start." This is fast, but the student often gets the details wrong (blurry text, weird shapes).
  2. The "Critic" Method (Score Distillation/GANs): Tell the student, "Paint something, and I'll critique it against the teacher's work." This makes the details sharp, but the student gets too scared to be creative, so they all start painting the exact same thing (boring, repetitive results).

This paper introduces a new method called rCM. It combines the best of both worlds: the speed of the Copycat and the sharpness of the Critic, without the downsides.


The Problem: The "Blurry Detail" Trap

The researchers first tried to scale up the "Copycat" method (called sCM) to massive models (14 billion parameters!) that can generate videos and complex images.

The Issue:
Imagine asking a student to draw a tiny, intricate watch face with the time "11:44 AM" on it.

  • The sCM student could draw the watch quickly, but the numbers were blurry, and the hands were crooked.
  • Why? The student was trying to learn a complex mathematical path in one giant leap. As the "leap" got bigger (to handle huge models), tiny math errors piled up, like a snowball rolling down a hill, eventually ruining the fine details.

The Solution: The "Safety Net" (rCM)

To fix this, the authors added a Score-Regularizer. Think of this as a Safety Net or a Spotter in gymnastics.

  1. The Main Act (sCM): The student tries to jump from the start to the finish in one go (fast, but risky).
  2. The Safety Net (Score Distillation): While the student is jumping, a "Critic" gently checks their form. If the student starts to drift off course (losing detail), the Critic gives a tiny nudge to correct them.

The Magic:

  • The Jump ensures the student stays diverse and creative (doesn't get stuck painting the same thing).
  • The Safety Net ensures the details (like text or object shapes) remain sharp and accurate.

The result? A student who can paint a masterpiece in 1 to 4 steps (instead of 50) that looks just as good as the teacher, but with all the creative variety of the teacher.


The Technical Hurdle: The "Heavy Lifting"

Scaling this up to massive models (like 14 billion parameters) is like trying to run a marathon while carrying a piano.

The Challenge:
The "Copycat" method requires a specific, heavy math operation called a Jacobian-Vector Product (JVP). In standard computer setups, this operation is slow and breaks when you try to split the work across many computers (parallelism). It's like trying to pass a heavy piano through a narrow hallway; it gets stuck.

The Fix:
The team built a custom FlashAttention-2 JVP Kernel.

  • Analogy: Imagine instead of carrying the piano through the hallway, they built a specialized conveyor belt that fits the piano perfectly and moves it instantly.
  • This allowed them to train these massive models on thousands of GPUs without the math breaking or the computer running out of memory.

The Results: What Can It Do?

The researchers tested this on two massive models:

  1. Cosmos-Predict2: A giant image generator.
  2. Wan2.1: A giant video generator (creating 5-second videos).

The Wins:

  • Speed: It generates images in 1 step and videos in 2 steps. That's a 15x to 50x speedup compared to the original slow models.
  • Quality: It can render tiny text (like "Casio G-Shock" on a watch) perfectly, which previous fast methods failed at.
  • Diversity: Unlike other fast methods that make everything look the same, rCM keeps the variety. If you ask for "a cat," you get different cats, not the same cat 10 times.

Summary

Think of rCM as the ultimate Art Teacher.

  • It teaches the student to be fast (like a sprinter).
  • It teaches the student to be accurate (like a surgeon).
  • It teaches the student to be creative (like an artist).

By combining a "forward-looking" strategy with a "corrective" safety net, and building a custom engine to handle the heavy math, they have created a framework that makes high-quality, instant AI video and image generation a reality.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →