Original authors: Sai-Aakash Ramesh, Archit Sood, Andrew Corbett, Tim Dodwell

Published 2026-05-28✓ Author reviewed ⓘ

📖 4 min read☕ Coffee break read

Original authors: Sai-Aakash Ramesh, Archit Sood, Andrew Corbett, Tim Dodwell

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a massive, messy library of books. Some books are about cooking, some about space, and some about history. Your goal is to create a small, manageable "highlight reel" of this library that captures the essence of the collection so you can find what you need quickly.

This paper introduces a new method called Supervised Distributional Reduction (SDR) to solve a specific problem with how we usually summarize data.

The Problem: The "Blind" Summarizer

Traditionally, when computers try to summarize a huge dataset (a process called "dimensionality reduction" or "clustering"), they act like a blind librarian. They look at the physical shape of the books—how thick they are, how heavy they are, or how close they sit on the shelf. They group similar-looking books together.

However, this blind approach has a flaw: it might group a book about "cooking pasta" with a book about "pasta shapes in physics" just because they both have the word "pasta" in the title, even though a human looking for a recipe would want them separated. The computer preserves the geometry (the shape of the data) but ignores the meaning (the labels or targets we care about).

The Solution: SDR (The "Smart" Summarizer)

The authors propose SDR, a method that acts like a librarian who has read the back covers. It doesn't just look at how books sit on the shelf; it actively checks the content to ensure the summary helps you find what you are actually looking for.

They achieve this by combining two powerful ideas:

Optimal Transport (The "Moving Trucks"): Imagine you need to move all the books from a giant warehouse to a few representative "shelves." Optimal Transport is the math that figures out the most efficient way to move the books so that the relationships between them stay the same. If two books were neighbors in the warehouse, they should remain neighbors on the new shelf.
Dependence Maximization (The "Relevance Check"): This is the new "secret sauce." The authors realized that just moving books efficiently isn't enough. You also need to make sure the books on the new shelf are actually relevant to the questions you're asking. They added a specific "relevance check" (using a metric called CKA) that forces the computer to align the summary directly with the answers (labels) you care about.

How It Works (The "Two-Step Dance")

The algorithm does a "two-step dance" to create the perfect summary:

Step 1: The Geometry Step. It uses the "Moving Trucks" math to arrange the data points so they keep their natural shape and structure.
Step 2: The Relevance Step. It adds a "Relevance Check" that pulls the arrangement toward the correct answers.

The paper argues that previous methods tried to do this by letting the "Moving Trucks" figure out the relevance indirectly. The authors found this was too weak—the trucks would get distracted by the shape of the books and forget the content. By adding the direct "Relevance Check," SDR ensures the summary is both structurally sound and highly useful for prediction.

The Bonus Feature: A "Magic Map" for New Data

Usually, when you summarize a dataset, you can't easily apply that summary to a new book that wasn't in the original library. You'd have to start over.

SDR solves this by creating a "Magic Map" (a mathematical projection). Once the summary is built, this map allows you to instantly place any new, unseen book onto the correct spot in the summary without re-doing the whole process.

Why This Matters for "Gaussian Processes"

The paper specifically highlights how this helps Gaussian Processes (GPs). You can think of a GP as a very smart predictor that guesses what will happen next based on past data.

Standard GPs are like a flat map: they assume the rules of the world are the same everywhere (e.g., "gravity is always 9.8 m/s²").
SDR helps create a 3D topographical map: it realizes that the rules might change depending on where you are. If the data is about cooking, the rules change in the kitchen vs. the garden.

By using SDR, the GP can build a "smart map" that adapts to the local shape of the data and the specific goals you have, making it much better at predicting outcomes in complex situations.

Summary

In short, the paper says: "Don't just summarize data by how it looks; summarize it by what it means." They built a tool (SDR) that uses advanced math to create compact, smart summaries of data that preserve the original structure while explicitly focusing on the answers you need, and they showed it works better than previous methods for making predictions.

Technical Summary: Supervised Distributional Reduction via Optimal Transport and Dependence Maximization

1. Problem Statement

The paper addresses the challenge of learning data representations that simultaneously capture intrinsic data geometry and target-relevant structure. While Distributional Reduction (DistR) offers a principled framework for unifying clustering and dimensionality reduction by learning a low-dimensional set of representative points via Optimal Transport (OT), existing methods are largely unsupervised. This limitation leads to representations that may fail to retain task-relevant information and lack a clear mechanism for out-of-sample generalization, rendering them less effective for downstream prediction tasks.

The authors identify a specific "supervision bottleneck" in extending OT-based methods to supervised settings: relying solely on the coupling matrix to mediate supervision (as in Fused Gromov-Wasserstein) often results in weak gradients for representation updates, causing the supervision signal to be diluted by structural constraints.

2. Methodology

2.1 Supervised Distributional Reduction (SDR)

The core contribution is SDR, an algorithm that learns target-aware representations by combining Optimal Transport with explicit dependence maximization.

Base Framework: SDR builds upon the Fused Gromov-Wasserstein (FGW) objective, which aligns the relational structure of the input distribution with a set of representative points (prototypes).
The Supervision Bottleneck: The authors demonstrate that in a standard FGW formulation, the supervised term depends on the coupling matrix $T$ but not directly on the embeddings $Z$ . Consequently, when $T$ is fixed, the gradient of the supervised loss with respect to $Z$ is zero. Even in joint optimization, the supervision signal reaching $Z$ is attenuated if the optimal coupling $T^*(Z)$ is locally insensitive to $Z$ .
Direct Dependence Maximization: To overcome this, SDR augments the objective with a direct dependence term based on Centered Kernel Alignment (CKA). The joint objective function $J_{SDR}$ is defined as:
$J_{SDR}(Z, T, h_Z) = (1-\alpha) \sum_{i,j} L_s(y_i, g^*_j(T))T_{ij} + \alpha \text{GW}(Z; T) - \eta \text{CKA}(Z, \tilde{Y})$
Here, the first term is the Barycentric Supervised FGW (BS-FGW) loss (where prototype targets $g^*_j$ are analytically eliminated via Bregman barycentric properties), the second is the geometric Gromov-Wasserstein loss, and the third is the negative CKA term (maximizing dependence between embeddings $Z$ and projected targets $\tilde{Y}$ ).
Optimization: The problem is solved via an inexact block coordinate descent scheme:
- T-step: Optimizes the semi-relaxed BS-FGW objective (ignoring CKA) to update the coupling matrix $T$ .
- Z-step: Optimizes the sum of the GW and CKA terms using SGD (e.g., Adam) to update embeddings $Z$ .

2.2 Out-of-Sample Extension via RKHS Projection

To enable the use of SDR in predictive pipelines where unseen data must be mapped to the learned embedding space, the authors formulate a mapping estimation problem. They enforce that the learned embeddings $Z$ lie close to the image of a function in a Reproducing Kernel Hilbert Space (RKHS).

They introduce a projection-consistency term to the objective, leading to a SDR-OOS formulation.
The mapping $L$ is learned as a regularized kernel ridge regression problem, providing a stable projection operator $z(x^*) = K(x^*, X)L$ for unseen points $x^*$ .

2.3 Application to Non-Stationary Kernel Construction

The learned SDR embeddings induce a data-dependent, non-stationary geometry. This allows for the construction of adaptive kernels for Gaussian Processes (GPs). By applying a stationary kernel (e.g., RBF) in the SDR embedding space, the induced kernel in the original input space becomes non-stationary and responsive to local variations in both data geometry and supervision. This approach decouples representation learning from GP training, offering a non-parametric alternative to Deep Kernel Learning (DKL).

3. Key Contributions

SDR Algorithm: A unified framework for supervised distributional reduction that integrates OT-based alignment with explicit dependence maximization (CKA) to learn compact, target-aware representations.
Theoretical Insight: Identification and resolution of the supervision bottleneck in FGW-based methods by introducing a direct representation-level dependence term.
Out-of-Sample Extension: A formulation of the input-to-embedding mapping as a regularized kernel ridge regression problem, enabling SDR to function as a feature extractor in predictive pipelines.
Non-Stationary Kernel Design: A mechanism for constructing adaptive kernels for GPs that respond to local data structure and supervision without requiring joint end-to-end training of deep networks.

4. Experimental Results

4.1 Distributional Reduction Benchmarks

The authors evaluated SDR on three classification datasets (COIL-20, Fashion-MNIST, SNAREseq) against DistR, Cluster-then-DR, and DR-then-Cluster.

Metrics: Homogeneity score, k-means Normalized Mutual Information (NMI), and Silhouette score.
Findings: SDR achieved comparable runtimes to DistR with modest computational overhead. Crucially, SDR produced representations with higher label consistency and semantic coherence, demonstrating that the explicit dependence term successfully captures target-relevant structure better than unsupervised baselines.

4.2 Kernel Learning Benchmarks (GPs)

SDR was evaluated as a feature extractor for Gaussian Processes on regression (Boston Housing, Energy Efficiency, Concrete) and classification (MNIST, COIL-20) tasks.

Comparisons: SDR-GP was compared against NCA-GP, KSPCA-GP, UMAP-GP, Deep Gaussian Processes (DGP), and Deep Kernel Learning (DKL).
Performance:
- Regression: SDR-GP achieved the best Mean Log Likelihood (MLL) and competitive Mean Squared Error (MSE) across all datasets, often outperforming DKL and DGP.
- Classification: SDR-GP achieved high Mean Log Probability (MLP) and Accuracy (ACC), matching or exceeding DKL performance.
- Uncertainty Calibration: SDR-GP provided reasonably calibrated uncertainties, comparable to or better than other methods, as evidenced by Mean Absolute Calibration Error (MACE) metrics.
Ablation: Experiments confirmed that the CKA term ( $\eta$ ) and the projection regularization ( $\beta$ ) are critical for balancing predictive signal retention and generalization.

5. Significance and Claims

The paper claims that SDR provides a principled, non-parametric approach to learning target-aware representations that preserve intrinsic geometry while explicitly maximizing dependence on task labels. By addressing the supervision bottleneck in OT-based methods, SDR enables the construction of compact representations that are effective for both clustering and downstream prediction.

The authors highlight that SDR offers a distinct advantage over Deep Kernel Learning: it decouples representation learning from the probabilistic model, avoiding the sensitivity to initialization and training difficulties often associated with joint optimization in low-data regimes. Furthermore, the induced non-stationary kernels offer a data-driven perspective on kernel design that adapts to local variations in supervision and structure.

The work suggests that combining transport-based structural alignment with explicit dependence maximization is a viable and effective strategy for supervised dimensionality reduction and distributional summarization, particularly in settings where interpretability and uncertainty quantification are required.

Supervised Distributional Reduction via Optimal Transport and Dependence Maximization