Wasserstein Distances Made Explainable: Insights Into Dataset Shifts and Transport Phenomena

Imagine you are a detective trying to figure out why two groups of people are so different from each other. Maybe one group is from New York and the other is from Tokyo. You know they are different, but how are they different? Is it the food? The weather? The way they dress?

In the world of data science, this is called measuring the distance between two distributions (groups of data). A popular tool for this is called the Wasserstein Distance. Think of it as a "moving cost." If you have a pile of sand in New York and want to move it to look like a pile of sand in Tokyo, the Wasserstein distance tells you the minimum amount of work (or fuel) it would take to move every grain of sand to its new spot.

The Problem:
Usually, when scientists calculate this "moving cost," they get a single number (e.g., "It costs 50 units of energy"). Or, they get a complex map showing exactly which grain of sand moved where.

The Number: Tells you how much things changed, but not what changed.
The Map: Shows the movement, but it's often too messy to read. It's like looking at a traffic jam from a helicopter; you see cars moving, but you can't tell if the accident was caused by a broken light, a bad driver, or a spilled coffee.

The Solution: WaX (Wasserstein Distances Made Explainable)
The authors of this paper, Philip, Jacob, and Grégoire, created a new tool called WaX.

Think of WaX as a high-tech magnifying glass or a spotlight that you can shine on your data. Instead of just giving you the total "moving cost," WaX breaks it down and says:

"Hey, 40% of the cost is because the New Yorkers are taller."
"Another 30% is because the Tokyo group eats more rice."
"And 10% is because of a specific outlier (one very strange person) who is moving a huge distance."

How Does It Work? (The Creative Analogy)

Imagine the Wasserstein distance calculation is a giant, complex machine made of gears and levers.

The Old Way: You press a button, the machine whirs, and a lightbulb turns on showing the total energy used. You have no idea which gear caused the most friction.
The WaX Way: The authors realized they could rewire this machine to look like a neural network (a type of AI brain). Once it looks like a brain, they can use a technique called Layer-wise Relevance Propagation (LRP).
- Imagine the lightbulb (the final answer) is glowing bright red.
- WaX works backward, tracing the red glow backwards through the wires.
- It asks: "Which wire carried the most red light?"
- It keeps going back until it reaches the very first inputs (the data points or features).
- Suddenly, you see exactly which specific features (like "height" or "rice consumption") are glowing the brightest. Those are the culprits causing the difference.

Why Is This Cool? (Real-World Examples)

The paper shows three ways this "spotlight" helps us:

1. Fixing Biased AI (The "Domain Adaptation" Use Case)
Imagine you train a robot to recognize cats using photos from a sunny beach. Then you try to use it in a snowy forest, and it fails. Why? Because the robot learned that "sand" means "cat."

WaX's Role: It shines a light on the features causing the robot to fail. It says, "Stop looking at the sand! Look at the ears!" By identifying and removing the "sandy" features, the robot becomes smarter and works in the snow too.

2. Understanding Aging (The "Abalone" Use Case)
Imagine you have a group of sea snails (abalone). You look at them when they are young, and then again a year later. They have grown.

WaX's Role: It doesn't just say "they grew." It breaks the growth down into subgroups. It might reveal that the small snails grew mostly in length, while the large snails grew mostly in weight. It untangles the complex process of aging into simple, understandable stories.

3. Spotting Dataset Differences (The "Face" Use Case)
Imagine you have two huge photo albums of famous people: one from Instagram (CelebA) and one from a news site (LFW).

WaX's Role: It scans the albums and finds the hidden differences. It might say, "The Instagram album has way more photos of women wearing sunglasses, while the news album has more photos of men in suits." This helps data scientists know if their training data is biased before they build an AI.

The Bottom Line

Before this paper, comparing two groups of data was like looking at a blurry photo of a car crash and just saying, "That was a bad crash."

With WaX, we can now look at the crash and say, "The crash happened because the driver was texting, the road was wet, and the brakes were old." It turns a mysterious number into a clear, actionable story, helping us understand why our data is shifting and how to fix it.

1. Problem Statement

Wasserstein distances (based on Optimal Transport, OT) are powerful tools for comparing data distributions, widely used to analyze dataset shifts, transport phenomena, and temporal evolution. However, a critical limitation exists:

Lack of Interpretability: Calculating the Wasserstein distance ( $W_p$ ) or inspecting the resulting transport plan (coupling matrix $\gamma^\star$ ) provides a scalar value or a mapping but fails to explain why the distance is high or low.
Blind Spots: The transport plan often aggregates information, making it difficult to pinpoint specific input features, data subgroups, or instances that drive the distribution shift.
Gap in XAI: While Explainable AI (XAI) methods (e.g., LRP, Shapley values) exist for classifying models, there are no systematic methods to attribute the distance between two distributions to their constituent components.

2. Methodology: WaX (Wasserstein Distances made Explainable)

The authors propose WaX, a novel framework that applies Explainable AI principles to Wasserstein distances. The core innovation is treating the calculation of the Wasserstein distance as a neural network and applying Layer-wise Relevance Propagation (LRP) to attribute the distance to data points and features.

A. Neuralization-Propagation Framework

WaX follows a two-step process:

Neuralization: The authors rewrite the Wasserstein distance calculation as a functionally equivalent, two-layer neural network.
- Input: Source distribution $\mu$ (points $x_k$ ) and Target distribution $\nu$ (points $y_l$ ).
- Layer 1 (Distance): Computes pairwise distances $z_{kl} = \|x_k - y_l\|_q$ .
- Layer 2 (Aggregation): Computes the weighted norm $W_p = (\sum_{kl} \gamma^\star_{kl} z_{kl}^p)^{1/p}$ , where $\gamma^\star$ is the pre-computed optimal coupling matrix.
- Key Insight: By fixing $\gamma^\star$ (the solution to the OT problem), the distance becomes a differentiable function of the input points, allowing for gradient-based explanation.
Propagation (LRP): The total Wasserstein distance is back-propagated through this network to assign relevance scores ( $R$ ) to:
- Instance Pairs ( $R_{kl}$ ): How much does the pair $(x_k, y_l)$ contribute to the total distance?
- Input Features ( $R_i$ ): How much does feature $i$ contribute to the distance?

B. Relevance Rules

The propagation uses specific LRP rules controlled by hyperparameters $\alpha$ and $\beta$ :

Pair Relevance: $R_{kl} = \frac{\gamma^\star_{kl} z_{kl}^\alpha}{\sum \gamma^\star_{kl} z_{kl}^\alpha} W_p$
Feature Relevance: $R_i = \sum_{kl} \frac{|x_{ki} - y_{li}|^\beta}{\sum_i |x_{ki} - y_{li}|^\beta} R_{kl}$
Heuristic: The authors propose $\alpha = p$ and $\beta = \min(p+2, q)$ . This balances the spread of relevance, preventing explanations from being too localized (which happens with high $q$ ) or too diffuse.

C. Extension: U-WaX (Subspace Explanations)

To handle complex, heterogeneous shifts, the authors introduce U-WaX, which decomposes the Wasserstein distance into subspaces (concepts).

It defines an orthogonal matrix $U$ that projects data into subspaces $S_c$ .
It optimizes $U$ to maximize "tailedness" statistics (using parameter $r$ ), effectively finding subspaces where the transport is most distinct or localized.
This allows the separation of a global shift into multiple "sub-shifts" (e.g., separating size changes from weight changes in biological data).

3. Key Contributions

First Attribution Framework for Distribution Distances: WaX is the first method to systematically attribute Wasserstein distances to individual data points and input features, filling a gap in XAI literature.
Theoretical Soundness: The method satisfies the conservation property (the sum of relevance scores equals the total Wasserstein distance) and connects to gradient computations under specific hyperparameter settings.
Model-Centric Flexibility: Unlike static statistical measures, WaX allows users to tune sensitivity to outliers by adjusting the Wasserstein parameters ( $p, q$ ) and the explanation parameters ( $\alpha, \beta$ ).
Subspace Disentanglement (U-WaX): Extends the concept to "Concept-based XAI," enabling the identification of distinct transport phenomena (sub-shifts) within a single dataset comparison.
Computational Efficiency: Unlike "occlusion" baselines that require re-computing the OT problem $d$ times (once per feature), WaX requires only a single forward pass and one backward pass, making it scalable to high-dimensional data.

4. Empirical Results

The authors evaluated WaX on diverse datasets (tabular, time-series, images) and compared it against baselines like MeanShift, Occlusion, and Coupling analysis.

Explanation Faithfulness (SRG Metric): WaX consistently achieved the highest Symmetric Relevance Gain (SRG) scores across various $W_p$ and Minkowski metric specifications ( $p, q$ ). It outperformed baselines, particularly in non-linear regimes (high $p, q$ ) where other methods failed to capture outlier sensitivity.
Transport Phenomena Characterization: In time-series experiments (Air Quality, Electricity), WaX successfully reconstructed ground-truth transport directions with high cosine similarity, outperforming linear classifiers and mean-shift baselines.
Use Case 1: Domain Adaptation: WaX was used to identify and prune domain-specific features in the Office-Caltech10 benchmark. It improved classifier robustness more effectively than FeatureOT, especially in sparse, high-dimensional feature spaces.
Use Case 2: Aging Phenomenon (Abalone): U-WaX successfully disentangled the aging process of abalones into distinct subgroups, revealing that size-related features (length/diameter) and weight-related features evolve differently across sub-populations.
Use Case 3: Dataset Differences (CelebA vs. LFW): U-WaX identified subtle semantic shifts between face datasets, such as differences in demographics (gender representation), accessories (glasses), and context (single vs. group photos), linking visual shifts to semantic concepts via CLIP embeddings.

5. Significance and Impact

Diagnostic Tool: WaX transforms the Wasserstein distance from a "black box" metric into an interpretable diagnostic tool. Practitioners can now validate why two datasets differ, identifying if the shift is due to noise, specific features, or structural changes.
Robust ML: By identifying domain-specific features, WaX aids in building more robust classifiers that rely on invariant signals rather than spurious correlations.
Scientific Insight: The ability to decompose transport phenomena into subspaces (U-WaX) offers new avenues for scientific discovery in fields like biology (cell development) and epidemiology, where understanding the mechanism of change is as important as the magnitude of change.
Future Directions: The authors suggest extending WaX to Gromov-Wasserstein distances (for unregistered spaces) and Sliced Wasserstein distances (for large-scale data), leveraging the subspace nature of U-WaX.

In summary, WaX bridges the gap between optimal transport theory and explainable AI, providing a rigorous, efficient, and flexible framework to understand the "why" behind distribution shifts.

Wasserstein Distances Made Explainable: Insights Into Dataset Shifts and Transport Phenomena

How Does It Work? (The Creative Analogy)

Why Is This Cool? (Real-World Examples)

The Bottom Line

1. Problem Statement

2. Methodology: WaX (Wasserstein Distances made Explainable)

A. Neuralization-Propagation Framework

B. Relevance Rules

C. Extension: U-WaX (Subspace Explanations)

3. Key Contributions

4. Empirical Results

5. Significance and Impact

More like this

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLMs

AutoSAM: an Agentic Framework for Automating Input File Generation for the SAM Code with Multi-Modal Retrieval-Augmented Generation

Trust as Monitoring: Evolutionary Dynamics of User Trust and AI Developer Behaviour

Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach