Terminal Velocity Matching

Imagine you are trying to teach a robot how to draw a perfect picture of a cat.

The Old Way: The Slow Hiker

For a long time, the best way to do this was Diffusion Models. Think of this like a hiker trying to get from the top of a mountain (a random pile of noise) to a beautiful valley (a clear picture of a cat).

The hiker takes tiny, careful steps. They look at the map, take one step, look again, take another. To get a really good picture, they might need to take 50 or 100 steps. It's accurate, but it's slow. If you want to generate a video or a high-resolution image, this takes forever and costs a lot of computer power.

The New Idea: The Teleporter

Researchers have been trying to build a "teleporter" that can jump straight from the mountain to the valley in one giant leap. This is called "One-Step Generation."

However, building a teleporter is hard. If you just tell the robot to "jump," it usually lands in a muddy swamp instead of the valley. Previous attempts at teleporters either required the robot to carry a heavy backpack (multiple data points) or they were unstable and crashed.

Enter TVM: The "Terminal Velocity" Trick

The paper introduces a new method called Terminal Velocity Matching (TVM).

Here is the analogy:
Imagine you are teaching a skateboarder to ride down a ramp.

Old Methods (Flow Matching): You tell the skateboarder, "Start at the top, and make sure your first push is perfect." If the first push is slightly off, the whole ride goes wrong.
TVM (Terminal Velocity Matching): Instead of worrying about the start, you tell the skateboarder, "I don't care how you start. Just make sure that right before you hit the bottom, you are moving at the exact right speed and angle to land perfectly."

Why is this better?
In physics, if you know exactly how fast and in what direction something is moving at the end of a trip, you can work backward to figure out the perfect path to get there. TVM forces the AI to learn the "perfect landing" (the terminal velocity) rather than the "perfect start." This gives the AI a much stronger, more stable goal to aim for.

The "Bumpy Road" Problem

There was a catch. The AI architecture they used (called a "Transformer") is like a car with very sensitive suspension. If you push it too hard, it shakes apart. The math behind TVM requires the car to be smooth and predictable (mathematically "Lipschitz continuous"), but standard AI cars aren't built that way.

The Fix: The authors made a few tiny, clever adjustments to the car's suspension (the AI's internal structure). They added special "shock absorbers" (normalization layers) that keep the car steady even when the math gets intense. This allowed them to train the teleporter without it exploding.

The "Super-Engine"

To make this work fast enough to be useful, they also built a custom engine part (a specialized computer code called a "Flash Attention kernel"). This engine allows the AI to calculate the "landing speed" incredibly fast, using less memory and time than previous methods.

The Results: Magic in a Blink

The results are impressive:

Speed: It can generate high-quality images in one step (one "teleport").
Quality: The pictures are just as good as the old slow methods that took 50 steps.
Efficiency: It works on high-resolution images (like 512x512 pixels) without needing a supercomputer the size of a house.

Summary

Think of TVM as teaching an AI to drive by focusing on the perfect parking job at the end of the driveway, rather than the first turn of the steering wheel. By fixing the car's suspension and giving it a turbo-charged engine, they created a system that can generate beautiful images instantly, solving the age-old problem of "quality vs. speed" in AI art.

1. Problem Statement

Current generative models, particularly Diffusion Models and Flow Matching (FM), produce high-fidelity samples but suffer from slow inference speeds due to the requirement of many sampling steps (e.g., 50+ ODE solver steps). While recent methods like Consistency Models (CT) and MeanFlow aim to reduce this to one or few steps, they face significant challenges:

Distributional Guarantees: Many few-step methods lack explicit theoretical connections to distribution matching (e.g., Wasserstein distance), often relying on heuristic trajectory matching.
Training Instability: Methods attempting to learn integrated trajectories often suffer from gradient instability, especially when using Classifier-Free Guidance (CFG) or when scaling to high resolutions.
Architectural Limitations: Standard Diffusion Transformers (DiT) lack Lipschitz continuity, which destabilizes training objectives that rely on velocity field derivatives.
Scalability: Some distribution-matching methods (like Inductive Moment Matching) require multiple particles per step, limiting scalability to large models.

2. Methodology: Terminal Velocity Matching (TVM)

TVM is a single-stage training framework that learns ground-truth flow trajectories by matching the terminal velocity of a trajectory rather than the initial velocity.

Core Concept

Instead of matching the velocity at the start of a trajectory (as in standard Flow Matching), TVM models the transition between any two timesteps $t$ and $s$ ( $s < t$ ). It defines a net displacement map $f_\theta(x_t, t, s)$ and enforces a Terminal Velocity Condition:
$\frac{d}{ds} f_\theta(x_t, t, s) = u_\theta(x_t + f_\theta(x_t, t, s), s)$
This condition ensures that the instantaneous velocity of the model's predicted path at the end of the interval ( $s$ ) matches the true velocity field at that point.

Theoretical Foundation

Wasserstein Upper Bound: The authors prove that minimizing the Terminal Velocity Matching loss provides an upper bound on the 2-Wasserstein distance between the data distribution and the model distribution, assuming the velocity field is Lipschitz continuous. This offers a stronger theoretical guarantee than previous trajectory-matching methods.
Unified Objective: The loss function jointly optimizes:
1. Terminal Velocity Error: Matching the derivative of the displacement map to the predicted velocity at the target point.
2. Flow Matching (Boundary Case): When $t=s$ , the objective reduces to standard Flow Matching, ensuring the model learns the correct instantaneous velocity field.

Practical Implementation & Architectural Changes

To make TVM stable and efficient, the authors introduce several critical engineering innovations:

Semi-Lipschitz Control: Standard DiTs are not Lipschitz continuous, causing training instability. The authors modify the architecture by:
- Replacing LayerNorm with RMSNorm.
- Applying RMSNorm to the Query-Key (QK) normalization.
- Normalizing the modulation parameters (scale and shift) in Adaptive LayerNorm (AdaLN) to prevent unbounded growth.
- Using Lipschitz initialization for linear layers.
Flash Attention JVP Kernel: The training objective requires computing the time derivative of the network output, involving Jacobian-Vector Products (JVP). Standard PyTorch attention is inefficient for JVP and lacks backward pass support. The authors developed a custom Flash Attention kernel that fuses the forward pass and JVP computation, supports backward passes through JVP, and significantly reduces memory usage (avoiding OOM errors).
Scaled Parameterization & CFG Handling:
- The network output is scaled by the CFG weight $w$ ( $f_\theta \propto w$ ) to handle varying guidance strengths linearly.
- The loss includes a $1/w^2$ weighting term to prevent gradient explosion during random CFG sampling.
- Unlike Consistency Models, TVM allows for random CFG sampling during training without collapse, though constant CFG yields slightly better FID.

3. Key Contributions

Theoretical Novelty: Proposes Terminal Velocity Matching, a new objective that upper-bounds the 2-Wasserstein distance, providing distribution-level guarantees for one-step models without requiring multiple particles.
Architectural Stability: Identifies and fixes the lack of Lipschitz continuity in DiTs, enabling stable single-stage training for few-step models.
Efficiency: Develops a specialized Flash Attention kernel for JVP backward passes, enabling efficient training of high-dimensional models that would otherwise run out of memory.
Unified Framework: The method naturally interpolates between one-step and multi-step sampling without retraining or curriculum learning.

4. Experimental Results

The authors evaluated TVM on ImageNet at 256×256 and 512×512 resolutions, comparing against state-of-the-art diffusion baselines and few-step methods (MeanFlow, sCT, IMM).

ImageNet-256×256:

1-NFE (One Step): Achieved 3.29 FID, outperforming MeanFlow (3.43) and IMM (8.05).
4-NFE: Achieved 1.99 FID, surpassing the standard DiT baseline (2.27) and matching high-step diffusion performance.

ImageNet-512×512:

1-NFE: Achieved 4.32 FID, outperforming sCT (4.33) and MeanFlow (5.24).
4-NFE: Achieved 2.94 FID, outperforming DiT (3.04) and sCT.

Training Efficiency:

TVM converges with random CFG sampling, whereas MeanFlow exhibits significant gradient norm fluctuations under the same conditions.
The method requires no training curriculum (warmup schedules) or complex loss modifications.

5. Significance

Terminal Velocity Matching represents a significant leap forward in one-step and few-step generative modeling. By shifting the focus from initial velocity to terminal velocity, the authors bridge the gap between theoretical distribution matching and practical, fast inference.

Scalability: The method scales effectively to high resolutions (512×512) and large model sizes (XL/2) without the computational overhead of multi-particle training.
Simplicity: It achieves state-of-the-art results with a conceptually simple objective and minimal architectural changes, making it a strong candidate for future high-performance generative AI systems.
Theoretical Insight: The proof linking the objective to the 2-Wasserstein distance provides a rigorous foundation for why these few-step models work, suggesting that controlling the terminal behavior of flow trajectories is key to distribution alignment.

In summary, TVM demonstrates that principled theoretical design, combined with targeted architectural and kernel-level optimizations, can overcome the traditional trade-off between generation quality, inference speed, and training stability.