Fast Estimation of Wasserstein Distances via Regression on Sliced Wasserstein Distances

The Big Problem: Moving Mountains is Expensive

Imagine you have two piles of sand, and you want to know exactly how much effort it would take to move one pile to look exactly like the other. In math and machine learning, this is called the Wasserstein Distance. It's a brilliant way to measure how different two groups of data are (like comparing two photos, two sets of medical scans, or two clouds of 3D points).

The Catch: Calculating this distance is like trying to move every single grain of sand one by one to find the perfect arrangement. It is incredibly accurate, but it is also painfully slow. If you have a lot of data, the computer has to do so many calculations that it might take hours or days. It's like trying to count every star in the sky to compare two galaxies.

The Current "Fast" Alternatives: The Cheap Approximations

To speed things up, scientists invented shortcuts called Sliced Wasserstein (SW) distances.

The Analogy: Instead of moving the whole 3D pile of sand, imagine shining a flashlight through it from different angles and looking at the 2D shadow.
The Benefit: Looking at the shadow is super fast.
The Problem: The shadow isn't the real object. Sometimes the shadows look identical even if the 3D objects are totally different. So, these fast methods are often inaccurate. They are like judging a book by its cover—they give you a hint, but not the whole story.

The Paper's Solution: The "Smart Translator"

The authors of this paper asked a clever question: "What if we could train a smart translator to look at the fast, cheap shadows and tell us exactly what the slow, expensive 3D distance would have been?"

They call this method RG (Regression on Sliced Wasserstein). Here is how it works, step-by-step:

1. The Training Phase (The "Study Session")

Imagine you have a student (the computer model) who needs to learn the relationship between "Shadows" (fast SW distances) and "Real Objects" (slow Wasserstein distances).

The teacher shows the student a few pairs of sand piles.
For each pair, the teacher calculates the Fast Shadow (easy) and the Real Distance (hard).
The student looks at the pattern: "Oh, when the shadow is X, the real distance is usually Y."
The student learns a simple formula (a linear equation) to predict the Real Distance just by looking at the Shadow.

The Magic: The student only needs to study a tiny number of examples (a "few-shot" approach). They don't need to memorize the whole library; they just need to understand the relationship.

2. The Prediction Phase (The "Speed Run")

Once the student has learned the formula, you can give them any new pair of sand piles.

You calculate the Fast Shadow (takes a split second).
You plug that number into the student's formula.
Boom! You get an estimate of the Real Distance that is almost as accurate as the slow method, but in a fraction of the time.

The Two Types of "Students" (Models)

The paper proposes two ways to train this student:

The Unconstrained Student: This student is free to guess any number. They look at the data and find the best mathematical fit. It's flexible but might sometimes guess a number that doesn't make physical sense (like a negative distance).
The Constrained Student: This student is given rules. They know that the "Shadow" is always smaller than the "Real Object" (or vice versa, depending on the type of shadow). By forcing the student to respect these rules, they learn faster and need fewer examples to get it right. This is like giving a student a hint: "The answer is always between 5 and 10."

Why Is This a Big Deal?

The authors tested this on real-world problems like:

3D Point Clouds: Comparing shapes of chairs, airplanes, and lamps (ShapeNet).
Medical Data: Comparing cells in the brain (MERFISH) or gene sequences (scRNA-seq).

The Results:

Speed: It is vastly faster than calculating the real distance.
Accuracy: It is much more accurate than the old "fast" methods.
Data Efficiency: It works great even when you don't have much data to train on.

The "Super-Powered" Upgrade: RG-Wormhole

The paper also introduces a hybrid tool called RG-Wormhole.

Wormhole is a famous, powerful AI that uses Wasserstein distances to learn how to generate new 3D shapes. But it's slow because it keeps doing the expensive math over and over.
RG-Wormhole replaces the expensive math with the "Smart Translator" formula.
The Result: You get the same high-quality 3D shapes, but the training happens much faster. It's like replacing a horse-drawn carriage with a sports car, but the car still drives on the same road.

Summary in One Sentence

This paper teaches a computer to guess the expensive, accurate answer by looking at cheap, fast approximations, allowing us to compare complex data sets instantly without losing precision.

1. Problem Statement

The Wasserstein distance (Optimal Transport distance) is a fundamental metric for comparing probability distributions, widely used in generative modeling, computational biology, and computer vision. However, computing the exact Wasserstein distance is computationally prohibitive for large-scale applications, typically requiring $O(n^3 \log n)$ time for discrete distributions of size $n$ .

While approximation methods exist (e.g., Sinkhorn iterations, entropic regularization), they remain expensive when applied to many pairs of distributions simultaneously (e.g., in dataset comparisons, nearest-neighbor classification, or training embeddings). Deep learning-based approaches like Wasserstein Wormhole learn embeddings to approximate these distances but suffer from high training costs, data hunger, and limitations to empirical distributions.

The paper addresses the need for a fast, data-efficient, and accurate method to estimate Wasserstein distances for multiple pairs of distributions drawn from a meta-distribution, without relying on heavy neural network training.

2. Methodology

The authors propose a Regression Framework (RG) that learns a mapping from Sliced Wasserstein (SW) distances to the true Wasserstein distance.

Core Concept

Instead of learning complex embeddings, the method treats the Wasserstein distance as a response variable and various Sliced Wasserstein variants as predictor variables. The key insight is that SW distances are computationally cheap ( $O(n \log n)$ ) and provide bounds or approximations of the true Wasserstein distance.

Predictors Used

The framework utilizes two categories of predictors to bound the true distance:

Lower Bounds:
- SW (Sliced Wasserstein): Standard projection-based distance.
- Max-SW: Maximizes the projection over directions.
- EBSW (Energy-based SW): Uses an energy-based distribution for projections.
Upper Bounds:
- PW (Projected Wasserstein): Lifted transportation plan using random projections.
- Min-SWGG (Minimum Sliced Wasserstein Generalized Geodesics): Minimizes the lifted cost.
- EST (Expected Sliced Transport): Energy-based lifted distance.

Regression Models

The paper introduces two linear regression models to predict $W_p(\mu, \nu)$ :

Unconstrained Linear Model:
$W_p(\mu, \nu) = \sum_{k=1}^K \omega_k S^{(k)}_p(\mu, \nu) + \epsilon$
This model admits a closed-form least-squares solution. It is flexible but requires estimating $K$ parameters.
Constrained Linear Model:
This model leverages the known theoretical bounds (Lower Bound $S_L$ and Upper Bound $S_U$ ). For a single pair of bounds, it takes the form:
$W_p(\mu, \nu) = \omega S_L(\mu, \nu) + (1-\omega) S_U(\mu, \nu) + \epsilon, \quad \text{where } 0 \le \omega \le 1$
This reduces the number of parameters by half (effectively $K/2$ ) and introduces inductive bias, making it more robust in low-data regimes (few-shot learning).

Training and Inference

Training: The model is trained on a small subset of pairs $(M \ll N)$ where the true Wasserstein distances are computed exactly (expensive step). The regression coefficients are learned via least squares.
Inference: For any new pair of distributions, the method computes the cheap SW distances and applies the learned linear combination to predict the Wasserstein distance. The computational complexity matches that of computing SW distances ( $O(n \log n)$ ).

RG-Wormhole

The authors also propose RG-Wormhole, a hybrid approach. They replace the expensive exact Wasserstein distance calculations within the Wasserstein Wormhole training loop (both encoder pairwise losses and decoder reconstruction losses) with the fast RG estimates. This preserves the embedding quality of Wormhole while drastically reducing training time.

3. Key Contributions

First Regression Framework: Introduces a novel framework where Wasserstein distance is regressed onto various SW distances (both lower and upper bounds) under a meta-distribution setting.
Efficient Linear Models: Proposes both unconstrained (closed-form solution) and constrained (parameter-reduced, bias-enhanced) linear models. These models are parsimonious and require only a few training pairs to learn.
RG-Wormhole: Demonstrates that replacing exact OT calculations in deep learning architectures with RG estimates yields a model (RG-Wormhole) that matches the performance of the original Wormhole but trains significantly faster.
Theoretical and Empirical Validation: Provides proofs for the closed-form solutions and extensive empirical validation across diverse datasets and dimensions.

4. Experimental Results

The method was evaluated on four datasets of increasing dimensionality: MNIST Point Clouds (2D), ShapeNetV2 (3D), MERFISH Cell Niches (254D), and scRNA-seq (2,500D).

Accuracy vs. State-of-the-Art:
- In low-data regimes (e.g., 10–100 training pairs), RG variants consistently outperform Wasserstein Wormhole and classical methods (Sinkhorn, Linear OT).
- On ShapeNetV2, RG-seo (using all 6 predictors) achieved 83.5% k-NN accuracy, nearly matching the exact Wasserstein distance (84.2%) and significantly outperforming single SW metrics (~72%).
- $R^2$ scores for RG variants were consistently high (often >0.9), whereas Wormhole struggled with low $R^2$ in small-data settings.
Speed and Efficiency:
- Inference: RG methods are orders of magnitude faster than exact Wasserstein and Sinkhorn. For 19,900 pairs, RG-seo took ~94 seconds compared to ~5,000 seconds for exact Wasserstein.
- Training (RG-Wormhole): RG-Wormhole reduced training time by a large margin (often exponential reduction as batch size increases) compared to standard Wormhole, while maintaining identical reconstruction quality, barycenter visualization, and interpolation capabilities.
Robustness: The method generalized well in intra-class and inter-class settings, even when trained on restricted subsets of data.

5. Significance

This paper offers a paradigm shift in approximating Optimal Transport:

Data Efficiency: It solves the "data hunger" problem of deep learning-based OT approximations by leveraging simple linear regression on cheap geometric proxies.
Scalability: It enables the use of Wasserstein distances in large-scale, real-time applications where exact computation is impossible and deep learning training is too slow.
Versatility: The framework is agnostic to the underlying distribution type (continuous or discrete) and works effectively across low to ultra-high dimensions.
Practical Impact: The introduction of RG-Wormhole provides an immediate, drop-in acceleration for existing OT-based deep learning pipelines, making high-fidelity distributional learning accessible with limited computational resources.