Score-Regularized Joint Sampling with Importance Weights for Flow Matching

Imagine you have a magical artist (the Flow Matching Model) who can draw anything you ask for, from a "sunny beach" to a "cyberpunk city." This artist is incredibly talented, but they have a habit: if you ask them to draw 10 pictures in a row, they tend to draw 10 almost identical pictures of the same sunny beach, just with slightly different clouds. They get stuck in their favorite "comfort zone."

This is a problem if you want to know the average of all the possible things the artist could draw. If you only ask for 10 pictures and they are all the same, your average will be wrong. You miss the rare but important things, like a "sunset beach" or a "stormy beach."

This paper proposes a new way to ask the artist for pictures so you get a diverse set of 10 unique images, while still being able to calculate the true average accurately.

Here is how they do it, broken down into three simple concepts:

1. The Problem: The "Groupthink" Artist

Normally, when you ask for 10 samples, the artist draws them one by one, independently. It's like asking 10 different people to guess the weather; if they all look out the same window, they might all guess "sunny," even if a storm is coming.

The Issue: If the artist has a rare but important style (like "stormy"), independent sampling might miss it entirely.
The Goal: We want the 10 samples to spread out and cover all the different styles (modes) the artist knows, not just the most popular one.

2. The Solution: The "Social Distancing" Rule (Score-Regularized Sampling)

To get diverse pictures, the researchers tell the artist: "Draw 10 pictures, but make sure they are different from each other." They do this by adding a "repulsive force" (like magnets pushing apart) that nudges the drawings away from each other as they are being created.

But here's the catch: If you push them too hard, the artist might get confused and draw nonsense (like a beach with a floating toaster). This is called "drifting off the map."

The Fix (Score-Regularization):
The researchers gave the artist a special compass called the "Score."

Think of the "Score" as a map that tells the artist where the "good, high-quality" areas are (the data manifold).
When the "Social Distancing" force tries to push a picture into a weird, low-quality area (off the map), the Score says, "No! Turn back! Stay on the path of good quality!"
The Result: The 10 pictures spread out to cover different styles (diversity), but they all stay within the realm of what looks realistic (quality). It's like herding cats: you want them to go in different directions, but you don't want them to jump off a cliff.

3. The Secret Sauce: The "Fairness Ticket" (Importance Weights)

Now, we have 10 diverse pictures. But because we forced them to be different, they aren't a "fair" random sample anymore. For example, if the artist usually draws "sunny" 90% of the time, but our diversity rule forced one picture to be "stormy," that "stormy" picture is now over-represented in our group of 10.

If we just take the average of these 10 pictures, our math will be wrong. We need to fix the math.

The Solution:
The researchers developed a way to calculate a "Fairness Ticket" (an Importance Weight) for each picture.

The Analogy: Imagine you are at a party. Usually, 90% of guests wear red shirts. But tonight, you forced 1 guest to wear a blue shirt to make the group diverse.
To calculate the "average opinion" of the party correctly, you can't just count the blue shirt as equal to the red shirts. You have to say, "That blue shirt represents a rare opinion, so let's count it as 10 red shirts worth of weight," or conversely, "The red shirts are common, so let's count them as 0.1 of a person."
How they do it: They train a tiny, fast "helper robot" (a residual velocity field) that learns exactly how the diversity rule changed the odds. This robot calculates the weight for each picture as it's being drawn.
The Result: You get a diverse group of pictures, but when you average them using these special weights, you get the exact same answer as if you had drawn 1,000,000 random pictures.

Summary: Why This Matters

Old Way: Ask for 10 pictures. Get 10 similar ones. Miss the rare stuff. Bad math.
New Way: Ask for 10 pictures. Force them to be different (Diversity). Use a compass to keep them realistic (Score-Regularization). Give each picture a "weight" based on how rare it is (Importance Weights).
The Payoff: You get a much better understanding of what the AI can do, using fewer resources. It's like getting a full tour of a museum by visiting 10 different rooms, instead of staring at the same painting 10 times, and then doing the math correctly to know the "average" beauty of the whole museum.

This method helps AI researchers trust their models more, especially when they need to make decisions based on the "average" behavior of the AI, rather than just hoping for a lucky, random draw.

1. Problem Statement

Flow matching models are powerful tools for generating complex distributions, but estimating the expectations of functions of their outputs (e.g., $E[f(X)]$ ) is challenging under limited sampling budgets.

The Variance Issue: Standard Independent and Identically Distributed (IID) sampling often yields high-variance estimates, particularly when rare but high-impact outcomes dominate the expectation.
The Diversity-Quality Trade-off: Existing non-IID joint sampling methods (e.g., Particle Guidance, DiverseFlow) introduce a "diversity velocity" to push concurrent trajectories apart. However, strong diversity forces often push samples into low-density, off-manifold regions, degrading sample quality. Weak forces preserve quality but fail to improve diversity significantly.
The Bias Issue: Existing joint sampling methods do not provide a mechanism to compute importance weights. Consequently, simple averages of jointly sampled outputs result in biased estimators of expectations because the marginal distribution of the joint sampler differs from the target distribution.

2. Methodology

The authors propose a framework called Score-Regularized Joint Sampling with Importance Weights (SRIW-Flow) that addresses both diversity/quality and unbiased estimation.

A. Score-Based Regularization for Diversity (SR)

To resolve the diversity-quality trade-off, the method constrains the direction of the diversity velocity using the model's score function ( $\nabla_x \log p(x|t)$ ).

Mechanism: The diversity gradient $g$ (derived from a diversity objective like pairwise distances) is decomposed into components parallel ( $g_{\parallel}$ ) and perpendicular ( $g_{\perp}$ ) to the score vector.
Regularization:
- If the diversity move aligns with higher density ( $g \cdot s \geq 0$ ), it is kept.
- If the move pushes toward lower density/off-manifold regions ( $g \cdot s < 0$ ), the parallel component is attenuated or removed (via a "soft" or "hard" constraint).
Result: This ensures samples are pushed apart primarily along the data manifold (within high-density regions), preventing off-manifold drift while maintaining high sample quality.

B. Importance Weight Estimation via Residual Flow

To enable unbiased estimation ( $\hat{\mu} = \frac{1}{n} \sum w_i f(X_i)$ ), the method learns the marginal distribution of the joint sampler to compute weights $w(x) = p(x)/p'(x)$ .

Residual Velocity Learning: Instead of re-learning the entire flow, a lightweight residual velocity field $r_\phi(x, t)$ is trained. This field, when added to the original pre-trained velocity $v(x, t)$ , creates a perturbed flow that matches the marginal distribution induced by the diversity-coupled sampler.
Trajectory-Based Evolution: Rather than estimating weights at a fixed position (which can suffer from out-of-distribution issues), the method integrates the evolution of the log-weight ratio along the sample trajectory.
- For Rectified Flows, the evolution of the importance weight is derived in closed form using the velocity field and the residual, avoiding the need to learn a separate score function.
- The weight update rule accounts for the divergence of the residual velocity and the interaction between the original velocity, the residual, and the diversity velocity.

3. Key Contributions

Non-IID Sampling Framework: A novel approach for flow matching that jointly draws multiple samples to cover diverse, salient regions of the distribution.
Score-Regularized Diversity: A mechanism that uses the score function to constrain diversity forces, effectively solving the trade-off between sample diversity and on-manifold quality.
Unbiased Importance Weighting: The first method (to the authors' knowledge) to compute importance weights for non-IID flow samples by learning a residual velocity field and evolving weights along trajectories.
Theoretical & Empirical Validation: Theoretical proofs for the weight evolution and comprehensive empirical validation showing improved diversity, quality, and estimation accuracy.

4. Experimental Results

The method was evaluated on Gaussian mixtures, Stable Diffusion 3.5 (Text-to-Image), and FLUX.1-Fill (Image Inpainting).

Gaussian Mixture (Synthetic):
- Diversity & Quality: Adding Score-Regularization (SR) significantly improved sample quality (higher log-probability, lower RMSE to modes) while maintaining high mode coverage. In contrast, standard diversity methods (like DiverseFlow) often sacrificed quality for diversity.
- Estimation Accuracy: The trajectory-based importance weight estimator achieved significantly lower Squared Error (SE) and better ranking metrics (Kendall's $\tau$ , Spearman's $\rho$ ) compared to density baselines (kNN, KDE) and fixed-position estimators.
- Expectation Estimation: The proposed method yielded the lowest Jensen-Shannon divergence when estimating the distribution of generated samples, outperforming both IID sampling and unweighted non-IID sampling.
Text-to-Image (Stable Diffusion 3.5) & Inpainting (FLUX.1):
- Coverage Radius: The method reduced the coverage radius (a metric for worst-case geometric coverage in latent space) across various prompts compared to IID sampling.
- Qualitative Improvement: Visual results showed that SR prevented artifacts (e.g., unreasonable images from pure diversity forces) and refined details (e.g., eye structure in inpainting) while maintaining the diversity of the output set.
- Efficiency: The approach demonstrated that maintaining on-manifold quality increases sample efficiency, requiring fewer samples to cover the distribution effectively.

5. Significance

This work provides a fundamental advancement in the reliable characterization of flow matching model outputs. By decoupling the diversity mechanism from off-manifold drift via score regularization and providing a rigorous path to unbiased expectation estimation via importance weighting, the paper enables:

More efficient use of expensive generative models (fewer samples needed for accurate statistics).
Safer deployment in applications requiring high-quality, diverse outputs (e.g., medical imaging, scientific simulation).
A plug-in solution that can be applied to existing pre-trained flow matching models without retraining the base model.

The authors have open-sourced their code to facilitate further research in managing the diversity-quality trade-off in generative modeling.

Score-Regularized Joint Sampling with Importance Weights for Flow Matching

1. The Problem: The "Groupthink" Artist

2. The Solution: The "Social Distancing" Rule (Score-Regularized Sampling)

3. The Secret Sauce: The "Fairness Ticket" (Importance Weights)

Summary: Why This Matters

1. Problem Statement

2. Methodology

A. Score-Based Regularization for Diversity (SR)

B. Importance Weight Estimation via Residual Flow

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Compositional Neuro-Symbolic Reasoning

DualDynamics: Synergizing Implicit and Explicit Methods for Robust Irregular Time Series Analysis

Robot Collapse: Supply Chain Backdoor Attacks Against VLM-based Robotic Manipulation