Overcoming Representation Bias in Fairness-Aware data Repair using Optimal Transport

Imagine you are a chef trying to create a perfectly balanced soup that represents the taste of an entire city. However, the only ingredients you have access to come from a small, specific neighborhood where everyone loves spicy food. If you just taste that neighborhood's soup and try to "fix" it to represent the whole city, you'll likely fail. You won't know what the bland, the sweet, or the savory flavors of the rest of the city actually taste like because you never tasted enough of them.

This is the problem of Representation Bias in Artificial Intelligence. AI models are often trained on data that over-represents certain groups (like white men or people with college degrees) and under-represents others (like women of color or people with less education). When we try to "fix" the AI to be fair, we often fail because we didn't gather enough data on the under-represented groups to understand them properly.

This paper proposes a clever new way to fix this, using a concept called Optimal Transport (think of it as a logistics map for moving data) and a smart "Stop-When-You-Know-Enough" rule.

Here is the breakdown of their solution using simple analogies:

1. The Problem: The "Under-Represented" Guest

Imagine a party where 90% of the guests are wearing red shirts, and only 10% are wearing blue shirts. If you want to plan a menu that everyone likes, but you only ask the red-shirted guests what they want, your menu will be terrible for the blue-shirted guests.

In AI, this is Representation Bias. The "blue shirts" (minority groups) are there, but there are so few of them in the training data that the AI doesn't learn their true patterns. It's like trying to guess the shape of a mountain by looking at only one tiny pebble.

2. The Old Way: The "Fixed Sample" Mistake

Previous methods tried to fix this by taking a fixed number of samples from every group. They might say, "Let's take 1,000 samples from the red group and 1,000 from the blue group."

The Flaw: If the blue group is naturally rare in the real world, forcing 1,000 samples might mean you are just repeating the same few blue-shirted people over and over again. You aren't learning the true variety of the blue group; you're just learning the same few people. The AI still doesn't understand the "blue" flavor.

3. The New Solution: The "Smart Tasting" Rule

The authors propose a Bayesian Nonparametric Stopping Rule. Let's translate this into our kitchen analogy:

Instead of deciding in advance how many people to taste, you keep tasting new people from the blue group until you are sure you understand their taste profile.

The Process: You taste a blue-shirted person. Then another. Then another.
The Check: After every new person, you ask yourself: "Did this new person teach me something new about what blue-shirted people like, or did they just taste like the last one?"
The Stop: As soon as the new person tastes very similar to the ones you've already met (meaning you've mapped out the full flavor profile of the blue group), you stop collecting data for that group.

This ensures that even if the blue group is tiny, you gather just enough unique information to understand them fully, without wasting time on duplicates. You don't stop because you hit a number; you stop because you hit knowledge.

4. The Repair: The "Fairness Transport Map"

Once you have fully understood the flavors of both the red and blue groups, you use Optimal Transport.

Think of this as a logistics company. You have a pile of "Red" ingredients and a pile of "Blue" ingredients. You want to create a "Fair" soup where the ingredients are mixed perfectly so that no one group is favored.

The "Optimal Transport" algorithm draws a map. It says, "Take this specific spicy ingredient from the Red group and move it here to balance with this mild ingredient from the Blue group."
It moves the data points (the ingredients) to a middle ground (the "Fair Target") so that the final result doesn't depend on whether you are Red or Blue.

5. Why This is Better

No More Guessing: Because the "Smart Tasting" rule ensures you fully understand the minority groups before you start, the repair works even for groups that are very rare.
Generalization: The old methods could only fix the specific data they had. This new method learns the rules of the minority groups, so it can fix new, unseen data (like archival data or future data streams) that it has never seen before.
Less Damage: Sometimes, fixing AI makes the data so weird that it loses its usefulness (like turning a delicious soup into water just to make it "fair"). This method measures how much "damage" it does to the data and tries to keep it to a minimum while still being fair.

The Bottom Line

This paper is about teaching AI to be fair by not rushing. Instead of forcing a fixed number of samples, it says: "Keep learning until you truly understand the under-represented groups, and then fix the data."

It's like saying, "Don't just ask 10 people what they think; keep asking until you are 100% sure you know what the whole neighborhood thinks, even if that neighborhood is small." This ensures that when the AI makes decisions, it treats everyone fairly, not just the loud majority.

Here is a detailed technical summary of the paper "Overcoming Representation Bias in Fairness-Aware Data Repair Using Optimal Transport" by Langbridge, Quinn, and Shorten.

1. Problem Statement

The paper addresses two critical limitations in current Fairness-Aware Data Repair (AIF) methods, specifically those utilizing Optimal Transport (OT):

Representation Bias in Learning: Existing OT-based repair methods learn transformation operators from training data. If the training data suffers from representation bias (i.e., underrepresented subgroups have insufficient samples), the OT operators for these subgroups are poorly learned. This leads to ineffective repairs for minority groups, exacerbating inequality.
Lack of Generalization: Current repair methods often require access to the entire static dataset to perform repairs. They fail to generalize to out-of-sample (archival or streaming) data that follows the same underlying distribution but was not part of the original training set.

The core challenge is to design a repair mechanism that is robust to imbalanced data distributions and capable of generalizing to new data without retraining on the full dataset.

2. Methodology

The authors propose a data-driven, Bayesian nonparametric framework that integrates sequential learning with Optimal Transport. The methodology consists of three main components:

A. Bayesian Nonparametric (BNP) Stopping Rule

Instead of using a fixed dataset size or arbitrary thresholds to learn subgroup distributions, the authors employ a Dirichlet Process (DP) prior to model the unknown conditional distributions $F(x|u,s)$ (where $u$ is an unprotected attribute and $s$ is a sensitive attribute).

Sequential Learning: Data is observed sequentially. The learning process for each subgroup is not stopped based on a pre-defined sample count but is "quenched" (stopped) only when a nonparametric stopping criterion is met.
Stopping Criterion: The rule utilizes the Kullback-Leibler Divergence (KLD) between successive posterior distributions of the Dirichlet process. Learning stops when the divergence between the current and previous distribution estimates falls below a threshold $\epsilon$ .
Outcome: This ensures that even underrepresented subgroups accumulate enough data to learn their distribution accurately, effectively decoupling the required sample size from the inherent class probabilities ( $p_{u,s}$ ).

B. Optimal Transport (OT) Repair Operators

Once the distributions for all subgroups are sufficiently learned (via the stopping rule), the authors construct repair operators:

Quantization: The continuous feature space is quantized based on the observed data points (sequential partition refinement).
Wasserstein Barycenter: The "fair" target distribution is defined as the Wasserstein barycenter (specifically at $t=0.5$ ) of the distributions of the sensitive groups ( $s=0$ and $s=1$ ) within each unprotected group ( $u$ ). This target is $s$ -invariant by definition.
Stochastic Operator ( $T_{u,s}$ ): A stochastic mapping is designed to transport data points from their original distribution to the barycenter. This involves:
1. Truncating the input to a quantization cell.
2. Using the optimal transport plan (coupling matrix) to probabilistically map the input to a target cell in the barycenter distribution.
3. Interpolating between target points to generate the final repaired value.

C. Novel Metrics

The paper introduces specific metrics to evaluate the trade-off between fairness and data utility:

Fairness Metric ( $\hat{E}$ ): A normalized ratio of the symmetrized KLD between sensitive groups before and after repair. $\hat{E} < 1$ indicates improved fairness; $\hat{E} = 0$ implies perfect independence.
Data Damage Metric ( $D$ ): A measure of information loss, calculated as the KLD between the repaired distribution and the original distribution. Lower values indicate less distortion of predictive information.

3. Key Contributions

Representation-Bias-Tolerant Learning: The introduction of a Bayesian nonparametric stopping rule ensures that learning for minority subgroups is not truncated prematurely due to low sample counts. This guarantees complete learning of all conditional distributions regardless of intersectionality or class imbalance.
Generalization to Out-of-Sample Data: By learning the operators (the quantization and transport maps) rather than just repairing a static dataset, the method can be applied to archival or streaming data generated from the same process, solving the generalization problem.
Novel Fairness-Damage Trade-off: The formulation of a new definition for the fair target (Wasserstein barycenter) and the associated metrics allows for a quantifiable trade-off between achieving fairness and preserving the predictive utility of the data.
Handling Intersectionality: The framework naturally handles intersectionality (e.g., non-white women) by learning distinct models for every $(u,s)$ subgroup, preventing the "dilution" of minority groups often seen in other methods.

4. Experimental Results

The authors validated their approach through simulations and benchmark datasets:

Stopping Rule Validation: Experiments on categorical and Gaussian Mixture Models (GMM) confirmed that the stopping rule adapts to data complexity. It stops earlier for coarse data and ensures convergence for complex mixtures, independent of random data realizations.
Robustness to Bias: In simulations with severe representation bias (e.g., minority class probability $Pr[U=0] = 0.025$ ), the proposed method achieved high fairness ( $\hat{E} \approx 0$ ) with consistent data damage. In contrast, methods using fixed sample sizes failed to learn minority distributions, resulting in poor repair quality.
Benchmarking (Simulated GMM): Compared against Geometric Repair and Distributional Repair (State-of-the-Art), the proposed method significantly outperformed both in terms of fairness ( $\hat{E}$ ) for both on-sample and off-sample data.
Adult Income Dataset: Applied to the real-world Adult Income dataset (predicting income based on education, sex, race). The method reduced sensitive attribute dependence in off-sample data by at least 3x compared to geometric repair (which cannot generalize) and outperformed distributional repair in fairness metrics while maintaining comparable data damage.

5. Significance

This work represents a significant step toward deployable AI fairness tools.

Practicality: It addresses the reality that data is often imbalanced and that models must generalize to new, unseen data (archival/streaming).
Regulatory Compliance: By ensuring that minority groups are not neglected due to data scarcity, the method aligns with emerging regulations like the EU AI Act, which mandates risk management and fairness.
Theoretical Advancement: It bridges Bayesian nonparametrics (for robust learning) and Optimal Transport (for distributional transformation), providing a rigorous mathematical foundation for fairness that moves beyond heuristic adjustments.

In summary, the paper proposes a robust, generalizable framework that ensures fairness is not compromised by the very data biases it seeks to correct, offering a viable solution for real-world machine learning deployment.

Overcoming Representation Bias in Fairness-Aware data Repair using Optimal Transport

1. The Problem: The "Under-Represented" Guest

2. The Old Way: The "Fixed Sample" Mistake

3. The New Solution: The "Smart Tasting" Rule

4. The Repair: The "Fairness Transport Map"

5. Why This is Better

The Bottom Line

1. Problem Statement

2. Methodology

A. Bayesian Nonparametric (BNP) Stopping Rule

B. Optimal Transport (OT) Repair Operators

C. Novel Metrics

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Realizing Common Random Numbers: Event-Keyed Hashing for Causally Valid Stochastic Models

Partition-Based Functional Ridge Regression for High-Dimensional Data

Co-Diffusion: An Affinity-Aware Two-Stage Latent Diffusion Framework for Generalizable Drug-Target Affinity Prediction

Efficient Approximation to Analytic and LpL^pLp functions by Height-Augmented ReLU Networks

Conformal e-prediction in the presence of confounding

Efficient Approximation to Analytic and $L^p$ functions by Height-Augmented ReLU Networks