Momentum Guidance: Plug-and-Play Guidance for Flow Models

Imagine you are trying to teach a robot to paint a masterpiece based on a simple description, like "a cat sitting on a fence."

The Problem: The "Blurry Dream" Robot

Right now, the best AI painters (called Flow Models) are incredibly talented, but they have a weird habit. When you ask them to paint, they often produce images that look like a dream you had after eating too much cheese.

The colors are there, the shapes are roughly right, but everything is soft, blurry, and lacks detail. The cat's fur looks like a fuzzy cloud, and the fence posts are melting into the background.

Why does this happen? Because the AI was trained to be "safe." It learned to predict the average of all possible cats and fences. In math terms, it smoothed out all the sharp edges and high-frequency details to avoid making mistakes. The result? A safe, but boring, blurry image.

The Old Fix: The "Double-Check" Method

To fix the blur, artists developed a technique called Classifier-Free Guidance (CFG). Think of this like asking the robot to paint the picture twice:

First pass: "Paint a cat." (The blurry version).
Second pass: "Paint a cat without any specific instructions." (A super-blurry, generic version).

Then, the computer takes the first version and pushes it away from the second version. It's like saying, "Okay, the generic cat is too fuzzy, so let's make the specific cat even sharper by comparing it to the fuzzy one."

The Catch: This works great, but it's twice as slow. The robot has to do double the work for every single step of the painting process. If you want a high-quality image, you have to wait twice as long.

The New Solution: "Momentum Guidance" (The Skateboarder)

This paper introduces a new trick called Momentum Guidance (MG). It's like giving the robot a skateboard instead of making it walk.

Here is the analogy:
Imagine the robot is a skateboarder trying to ride down a hill to reach the "perfect image" at the bottom.

The Old Way (CFG): The skateboarder stops at every single step to ask a friend, "Is this the right direction?" and then asks another friend, "What would a generic path look like?" Then they compare notes. It's accurate, but it takes forever.
The New Way (MG): The skateboarder looks at their recent history.
- "I was moving slowly and smoothly a moment ago (the blurry past)."
- "I am moving faster and more sharply right now."
- "Let's use that difference to push me even harder in the right direction!"

How it works simply:
The AI remembers the "velocity" (the direction and speed) of its previous steps. It calculates an average of where it has been (which is usually smooth and blurry) and then pushes the current step away from that average.

It's like a skier who remembers the smooth, wide turns they made at the top of the mountain. As they get closer to the bottom (the final image), they remember those wide turns and deliberately steer sharper to carve out the details.

Why is this a Big Deal?

It's Free (in terms of time): The robot doesn't need to do extra work. It just uses the information it's already calculating as it paints. It's like getting a bonus feature without paying extra.
It's Sharper: The images come out with crisp details—individual hairs on the cat, clear reflections on a car, sharp edges on buildings.
It Works with the Old Way: You can use this new "skateboard" trick on top of the old "double-check" method to get even better results, or use it alone to save time.

The Result

The researchers tested this on famous AI models (like Stable Diffusion 3 and FLUX).

Before: A blurry, dream-like image.
After: A crisp, high-definition photo where you can see the texture of the wood on the fence and the whiskers on the cat.

In a nutshell: Momentum Guidance is a clever way to tell the AI, "Don't just follow the smooth, safe path. Remember where you've been, and use that memory to push yourself toward the sharp, exciting details." It makes AI art faster, sharper, and more detailed without needing more computer power.

1. Problem Statement

Flow-based generative models (including Diffusion and Rectified Flow models) have achieved state-of-the-art results in image synthesis. However, a critical practical issue remains: pretrained models often produce "diffuse" samples when used in their vanilla conditional form.

The Cause: Neural networks inherently learn smoothed approximations of data distributions. Combined with Exponential Moving Averages (EMA) of model parameters (used to reduce noise during training), the learned velocity fields tend to suppress high-frequency details, resulting in blurry textures, muted contrast, and overly spread-out distributions.
The Limitation of Existing Solutions:
- Classifier-Free Guidance (CFG): The standard solution involves extrapolating the conditional prediction away from an unconditional prediction. While effective at sharpening images, CFG doubles the inference cost (requiring two forward passes per step) and often reduces sample diversity (lower recall) as the guidance scale increases.
- Autoguidance: Uses a weaker version of the same model as a reference. While it avoids doubling the cost, it requires auxiliary checkpoints (often unavailable for large open models) and increases memory usage.

2. Methodology: Momentum Guidance (MG)

The authors propose Momentum Guidance (MG), a plug-and-play inference-time technique that leverages the ODE trajectory itself to generate a guidance signal without additional model evaluations.

Core Concept

The method is based on the observation that in flow-based sampling, the marginal distributions become sharper as time $t$ progresses. Conversely, velocities at earlier time steps correspond to smoother marginals. Instead of computing a separate "unconditional" or "weaker" velocity field, MG reuses the history of the current trajectory to form a smoother reference.

Algorithm

MG maintains an Exponential Moving Average (EMA) of past velocities ( $m_t$ ) and extrapolates the current velocity ( $v_t$ ) away from this momentum.

Initialization: Sample $Z_{t_0} \sim \mathcal{N}(0, I)$ and initialize momentum $m_{t_0} = v_\theta(Z_{t_0}, t_0)$ .
Update Loop: At each timestep $t_i$ $t_{i}$ :
- Compute current velocity: $v_{t_i} = v_\theta(Z_{t_i}, t_i)$ .
- Update Momentum (EMA): $m_{t_{i+1}} = (1 - \beta)v_{t_i} + \beta m_{t_i}$ , where $\beta \in [0, 1)$ controls the decay of historical velocities.
- Extrapolation Step: Update the latent state using an extrapolated velocity:
  $Z_{t_{i+1}} = Z_{t_i} + \Delta t \left[ v_{t_i} + \alpha (v_{t_i} - m_{t_i}) \right]$
  Here, $\alpha > 0$ is the guidance strength. The term $(v_{t_i} - m_{t_i})$ represents the difference between the current sharp velocity and the smoothed historical average, effectively acting as a "sharpening" direction.

Key Advantages

Zero Extra Cost: MG requires only one model evaluation per step, identical to the baseline or CFG (if CFG is used, MG treats the CFG-adjusted velocity as the input).
No Auxiliary Models: It does not require unconditional branches, weaker checkpoints, or additional networks.
Compatibility: It functions effectively both as a standalone method and when combined with CFG.

3. Key Contributions

Novel Guidance Mechanism: Introduced Momentum Guidance, which extracts guidance signals directly from the ODE trajectory history, eliminating the need for auxiliary models or double inference.
Theoretical Insight: Demonstrated that past velocities in the flow trajectory naturally provide the "smoother reference" needed for guidance, analogous to momentum in optimization but applied to generative sampling.
Plug-and-Play Implementation: The method is a simple modification to the Euler sampler update rule, making it immediately applicable to existing flow models (e.g., Rectified Flow, Diffusion Transformers).
Comprehensive Evaluation: Validated across diverse benchmarks (ImageNet-256, Stable Diffusion 3, FLUX.1-dev) and sampling budgets.

4. Experimental Results

The authors evaluated MG on ImageNet-256, Stable Diffusion 3 (SD3), and FLUX.1-dev.

ImageNet-256 (Rectified Flow):
- Without CFG: MG achieved a 36.68% improvement in FID (Fréchet Inception Distance) on average compared to vanilla sampling.
- With CFG: MG achieved a 25.52% improvement in FID over standard CFG.
- Best Performance: At 64 sampling steps with CFG, MG attained an FID of 1.597.
- Efficiency: MG effectively halves the inference cost compared to achieving similar quality with standard CFG (which requires 2x NFEs).
- Diversity: Unlike standard CFG, which often degrades recall (diversity) as guidance strength increases, MG maintains or even improves recall while boosting precision.
Large-Scale Models (SD3 & FLUX.1-dev):
- MG consistently improved Human Preference Score (HPSv2.1) and ImageReward scores across various CFG scales.
- Qualitative results showed sharper details (e.g., clearer facial contours, intricate textures like coral or wings), reduced artifacts (blur, floating objects), and better geometric stability.
Ablation Studies:
- The method is robust across a wide range of hyperparameters ( $\alpha$ and $\beta$ ).
- Optimal performance is typically found with moderate $\alpha$ and small-to-medium $\beta$ .
- Combining MG with CFG interval scheduling (restricting guidance to specific time steps) yields further gains.

5. Significance

Cost-Efficiency: MG solves the "fidelity vs. cost" trade-off. It provides the sharpening benefits of guidance without the computational penalty of CFG, making high-quality generation accessible under constrained sampling budgets.
Scalability: Since it requires no additional training or auxiliary checkpoints, MG is immediately deployable on large, proprietary, or open-source flow models where auxiliary data is unavailable.
Quality-Diversity Balance: It addresses the common issue where aggressive guidance sacrifices diversity. MG improves fidelity (precision) without the typical collapse in sample diversity (recall).
Practical Impact: By offering a simple, one-line modification to the sampling loop, MG provides a practical, scalable path to enhance the visual quality of the next generation of generative AI models.