No evaluation without fair representation : Impact of label and selection bias on the evaluation, performance and mitigation of classification models

Here is an explanation of the paper, translated into simple language with creative analogies.

The Big Idea: You Can't Judge a Book by Its Biased Cover

Imagine you are a teacher trying to grade a class of students. However, there's a problem: the test papers you have been given to grade were tampered with before the students even saw them.

Scenario A (Label Bias): Someone took a red pen and crossed out the "A" grades of the girls, changing them to "F"s, even though the girls actually did great work.
Scenario B (Selection Bias): The teacher only collected test papers from the back row of the classroom, ignoring the front row entirely. Or worse, they only collected papers from students who volunteered to take the test, and the shy students (who might have studied harder) stayed home.

This paper argues that if you use these tampered papers to judge the students' abilities, you will get the wrong answer. Furthermore, if you try to "fix" the grading system using these bad papers, your fixes might make things worse.

The authors, Magali Legast, Toon Calders, and François Fouss, built a special laboratory to test this. They wanted to answer three questions:

How does tampering with data mess up our evaluation of AI models?
Does fixing the bias actually hurt the model's accuracy (the old "fairness vs. accuracy" debate)?
Which "fixes" work best for which specific type of tampering?

The Experiment: Building a "Time Machine" for Data

Usually, researchers take a real-world dataset (like a list of loan applications) and say, "This data is biased, let's try to fix it." The problem is, they don't know exactly how the bias got there, so they can't be sure if their fix actually worked.

The Authors' Solution:
They started with a dataset they believed was already fair (like a clean, honest record of student performance). Then, they acted like "data saboteurs." They artificially injected specific types of bias into the data to create a "distorted world."

The Setup: They had a Fair World (the original clean data) and a Biased World (the tampered data).
The Test: They trained AI models on the Biased World but tested them on the Fair World.
The Result: This allowed them to see the true performance of the AI, rather than just how well it fit the broken data.

Key Findings (The "Aha!" Moments)

1. The "Fairness vs. Accuracy" Myth is Dead

For years, experts believed you had to choose: either your AI is accurate (predicts well) or it is fair (treats everyone equally), but you can't have both.

The Paper's Verdict: This trade-off is an illusion caused by bad testing.

Analogy: Imagine a runner training on a track with a giant hill on one side. If you tell the runner to "run fast," they might lean into the hill, making them slower overall. If you tell them to "run fair" (ignore the hill), they might stumble.
The Reality: When you test the runner on a flat, fair track (the unbiased test set), you realize that the runner who learned to ignore the hill actually runs the fastest and the fairest.
Conclusion: When you evaluate models on fair data, you often find that you can improve fairness without losing accuracy. In fact, the most accurate models are often the fairest ones.

2. Not All "Fixes" Work for All "Breaks"

The paper tested eight different methods to fix bias (like "Reweighing," "Massaging," and "FTU"). They found that one size does not fit all.

Analogy: Think of bias like different types of injuries.
- Label Bias is like a broken leg. You need a cast (a specific fix).
- Selection Bias is like a sprained ankle. You need a brace (a different fix).
- If you try to put a cast on a sprained ankle, it won't help, and might even make it worse.

Specific Findings:

Reweighing (giving more importance to underrepresented groups) worked great for Selection Bias (missing data), but was okay for Label Bias.
Massaging (changing labels of people near the decision line) worked well for Label Bias (wrong labels), but made Selection Bias much worse.
FTU (ignoring sensitive attributes like gender/race) worked surprisingly well across the board, but only if the other data features were strong enough to predict the outcome without needing to "peek" at the sensitive info.

3. The Danger of "Reverse Discrimination"

Some methods tried to fix bias by aggressively flipping the results to make the numbers look equal.

Analogy: Imagine a scale that is tipped to the left. To fix it, you don't just add weight to the right; you start throwing heavy rocks off the left side. Eventually, the scale tips too far to the right, and now the other group is being treated unfairly.
The Paper's Warning: Some methods, when applied to the wrong type of bias, actually created "reverse unfairness," hurting the privileged group to over-compensate for the bias.

Why This Matters for You

This paper is a wake-up call for anyone building or using AI.

Don't trust the test set: If you train an AI on biased data and test it on the same biased data, you are just confirming the bias. You need a "gold standard" or a "fair test" to see if the AI is actually doing a good job.
Know your enemy: Before you try to fix an AI, you need to know how it got broken. Was the data stolen? Were the labels wrong? Different problems need different solutions.
Fairness and Accuracy are friends: Stop thinking you have to sacrifice accuracy to be fair. If you measure things correctly, you'll find that the best models are usually both accurate and fair.

The Bottom Line

The authors built a "bias simulator" to prove that our current way of testing AI is flawed. They showed that if we stop using broken rulers to measure our tools, we can build AI systems that are not only smarter but also fairer, without having to make a painful trade-off between the two.

No evaluation without fair representation : Impact of label and selection bias on the evaluation, performance and mitigation of classification models

The Big Idea: You Can't Judge a Book by Its Biased Cover

The Experiment: Building a "Time Machine" for Data

Key Findings (The "Aha!" Moments)

1. The "Fairness vs. Accuracy" Myth is Dead

2. Not All "Fixes" Work for All "Breaks"

3. The Danger of "Reverse Discrimination"

Why This Matters for You

The Bottom Line

1. Problem Statement

2. Methodology: The Biasing and Evaluation Framework

3. Key Contributions

4. Key Results

A. Impact of Evaluation Bias

B. Impact of Bias on Model Performance (Unmitigated)

C. Performance of Mitigation Methods

5. Significance and Implications

No evaluation without fair representation : Impact of label and selection bias on the evaluation, performance and mitigation of classification models

The Big Idea: You Can't Judge a Book by Its Biased Cover

The Experiment: Building a "Time Machine" for Data

Key Findings (The "Aha!" Moments)

1. The "Fairness vs. Accuracy" Myth is Dead

2. Not All "Fixes" Work for All "Breaks"

3. The Danger of "Reverse Discrimination"

Why This Matters for You

The Bottom Line

1. Problem Statement

2. Methodology: The Biasing and Evaluation Framework

3. Key Contributions

4. Key Results

A. Impact of Evaluation Bias

B. Impact of Bias on Model Performance (Unmitigated)

C. Performance of Mitigation Methods

5. Significance and Implications

More like this

Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

Missingness Bias Calibration in Feature Attribution Explanations

Why Is RLHF Alignment Shallow? A Gradient Analysis

Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness

U-Parking: Distributed UWB-Assisted Autonomous Parking System with Robust Localization and Intelligent Planning