Mitigating Bias in Concept Bottleneck Models for Fair and Interpretable Image Classification

Imagine you are hiring a new employee, but instead of looking at their resume or interviewing them directly, you ask a very smart, but slightly biased, assistant to describe them first.

This paper is about fixing a specific type of AI system called a Concept Bottleneck Model (CBM). Think of a CBM as a "middleman" AI. Instead of looking at a photo and guessing what's happening, it first translates the photo into a list of human-readable ideas (concepts) like "wearing a tie," "holding a spatula," or "standing in a kitchen." Then, it uses that list to make a final decision.

The goal is to make AI fairer and easier to understand. But the authors found a problem: The middleman is leaking secrets.

Here is the breakdown of the paper using simple analogies:

1. The Problem: The "Over-Talkative" Translator

The researchers wanted to use CBMs to stop AI from being biased (e.g., assuming only men are chefs or only women are nurses). The idea was: "If the AI only looks at 'holding a spatula' and ignores 'has a beard,' it will be fair!"

However, they discovered that even though the AI is looking at "holding a spatula," the way it calculates that concept secretly includes information about gender.

The Analogy: Imagine a translator who is supposed to translate a sentence from French to English. But, the translator accidentally whispers the speaker's accent and gender into the English translation. Even if the English words are correct, the "whisper" reveals the speaker's identity, allowing the listener to make biased assumptions.
The Result: The AI was still "hearing" the gender, even though it was supposed to be looking only at the actions. This is called Information Leakage.

2. The Solution: Three Ways to Muffle the Leaks

The team tried three different ways to stop the AI from leaking this secret gender information.

Technique A: The "Top-K" Filter (The Spotlight)

Instead of letting the AI look at every single tiny detail it found (which includes the noisy, biased whispers), they told it to only focus on the top 20 most important concepts.

The Analogy: Imagine you are trying to identify a song. You could listen to the entire 3-hour concert recording (which includes the crowd noise, the band tuning up, and the singer's coughing). Or, you could put on noise-canceling headphones and only listen to the top 20 loudest notes.
The Outcome: By forcing the AI to ignore the "background noise" (the subtle gender clues hidden in weak concepts), it became much fairer. It didn't lose much accuracy, but it stopped making gender-based guesses.

Technique B: Removing the "Bad Apples" (The Edit)

They tried to find concepts that were obviously biased (like "necktie" for men or "blouse" for women) and delete them from the AI's vocabulary.

The Analogy: It's like trying to fix a biased jury by kicking out the jurors who wear specific hats.
The Outcome: This didn't work well. Why? Because the AI is sneaky. If you remove "necktie," it just starts using "short hair" or "deep voice" as a new way to guess the gender. The "leak" just moved to a different pipe.

Technique C: The "Adversarial" Game (The Coach)

They added a second AI (a "coach") whose only job is to try to guess the gender based on the first AI's answers. The main AI then tries to get better at its job without letting the coach guess the gender.

The Analogy: Imagine a student taking a test. A proctor stands next to them and tries to guess the student's gender based on how they write. The student realizes, "Oh, if I write too neatly, the proctor knows I'm a girl." So, the student learns to write in a way that gives no clues about their gender, while still getting the right answers.
The Outcome: This was the most effective method. It forced the AI to learn the task (like "frying an egg") without relying on gender clues at all.

3. The Big Trade-off: The "Goldilocks" Zone

The paper found a tricky balance, like finding the perfect temperature for a shower:

Too many concepts: The AI is very accurate but leaks too much bias (too hot).
Too few concepts: The AI is very fair but makes too many mistakes (too cold).
Just right: Using the Top-K Filter combined with the Adversarial Coach, they found a sweet spot. The AI became 28% fairer with almost no loss in accuracy.

Why This Matters

Most AI models are "Black Boxes"—you put a picture in, and a guess comes out, but you don't know why.

Old Way: The AI guesses "Chef" because it saw a man. You can't see the bias.
New Way (CBM): The AI says, "I guessed Chef because I saw a spatula and a stove."
The Fix: The authors showed that even with this transparent system, the AI was still sneaking in bias. But by using their new filters and coaching methods, they made the system both transparent AND fair.

In a nutshell: They built a smarter, more honest AI that explains its reasoning, and then taught it how to stop listening to the "whispers" of bias that were hiding in its own explanations.

1. Problem Statement

While Concept Bottleneck Models (CBMs) offer a promising architecture for interpretable image classification by mapping inputs to human-understandable concepts before making predictions, they suffer from a critical flaw: information leakage.

The Issue: Although CBMs are designed to mask sensitive attributes (like gender) by focusing on task-relevant concepts, the concept activation vectors often encode hidden patterns and proxy variables unrelated to the concept's semantics.
The Consequence: This leakage allows the model to inadvertently learn and amplify biases (e.g., associating specific actions with a specific gender) despite the interpretable structure.
The Gap: Existing bias mitigation techniques often rely on black-box deep neural networks (DNNs) or require expensive, error-prone ground truth labels for sensitive attributes. There is a need for methods that improve fairness in CBMs without sacrificing their inherent interpretability or requiring sensitive attribute labels during the debiasing process.

2. Methodology

The authors propose a modified Label-free CBM framework leveraging CLIP (Contrastive Language-Image Pre-training) and GPT-3 to automate concept generation and inference, evaluated on the ImSitu dataset (an action recognition dataset with gender ground truths).

Core Architecture

Concept Generation: GPT-3 generates a list of concepts for 200 action verbs (e.g., "frying," "dancing").
Filtering: Concepts are filtered based on length, similarity to class labels, similarity to other concepts, and low activation on the dataset.
Encoding: Images are encoded using CLIP's image encoder (ViT-B/16), and concepts are encoded using CLIP's text encoder.
Prediction: A sparse, fully connected (FC) layer maps concept activations to target classes.

Proposed Bias Mitigation Techniques

The paper introduces three specific techniques to address bias:

Decreasing Information Leakage (Top-k Filter):
- Instead of relying solely on sparsity regularization (L1/L2) which forces the model to use fewer concepts but may still leak information through low-weight concepts, the authors propose a Top-k Concept Activation Filter.
- This method retains only the top $k$ concept activations for each image and zeros out the rest. This mimics a human mental model focusing on prominent features, reducing the "hidden distribution" encoded by low-value concepts.
- Advantage: Does not require sensitive attribute ground truth labels.
Removing Biased Concepts:
- Approach A: Train a gender classifier on the CBM to identify concepts with the highest weights for gender and remove them.
- Approach B: Use an LLM to self-rate concepts based on semantic association with sensitive attributes and remove them.
- Observation: Simply removing concepts during training failed because the model re-learned to leak gender information through different concepts. The authors found success only by zeroing out these specific concepts at test time on a model trained without them.
Adversarial Debiasing:
- A dual-objective optimization is applied to the FC layer. The main model minimizes classification loss, while an adversary attempts to predict the sensitive attribute (gender) from the model's output.
- The main model is penalized if the adversary succeeds, forcing the concept-class mapping to become invariant to gender.
- Advantage: Provides granular improvements and transparency into which concepts are being adjusted to reduce bias.

3. Key Contributions

Identification of Leakage: Demonstrated that CBM concepts encode hidden information beyond their semantics, creating a tradeoff between fairness, interpretability, and performance.
Top-k Filter: Proposed a novel filtering mechanism that outperforms traditional sparsity regularization in fairness-performance tradeoffs without needing sensitive attribute labels.
Adversarial Integration: Successfully adapted adversarial debiasing to the interpretable CBM framework, showing that it can further reduce bias while maintaining transparency.
Empirical Validation: Showed that removing concepts based on semantics alone is insufficient due to model re-learning, necessitating test-time intervention or adversarial training.

4. Results

Experiments were conducted on the ImSitu dataset (20,792 images, 200 verbs) comparing CLIP-ZS (Zero-Shot), CLIP-DNN (Standard Deep Learning), and CLIP-CBM (Concept Bottleneck).

Baseline Performance:
- CLIP-CBM achieved slightly lower accuracy (41.51%) than CLIP-DNN (44.10%) but offered better interpretability.
- However, standard CLIP-CBM only marginally reduced bias amplification (8.19%) compared to CLIP-DNN (8.68%), confirming the leakage issue.
Impact of Mitigation Techniques:
- Top-k Filter: At $k=1000$ , the model approached DNN-level accuracy while significantly lowering bias. At $k=30$ , it achieved 37.5% accuracy with much better fairness than sparsity-based methods.
- Concept Removal: Removing biased concepts at test time reduced bias amplification by only 0.3%–0.5% with a ~0.6% accuracy loss, proving this method alone is weak.
- Adversarial Debiasing: When applied to the Top-k CBM, it reduced bias amplification from 7.84% to 6.29% (a 28% reduction in bias amplification relative to the baseline) with a minimal accuracy drop (to 42.69%).
Comparison to Prior Work:
- The proposed Top-k + Adversarial Debiasing approach outperformed prior adversarial debiasing on black-box DNNs (which achieved 7.21% bias amplification) and standard CBMs.

5. Significance and Conclusion

Fairness-Interpretability Tradeoff: The paper highlights that perfect transparency is difficult because the transformation from image to concept space is not perfect; high performance requires many concepts, which increases the risk of leakage.
Practical Applicability: The proposed Top-k filter is significant because it improves fairness without requiring sensitive attribute ground truth labels, making it applicable to real-world scenarios where such labels are unavailable or expensive.
Transparency in Debiasing: Unlike black-box adversarial training, applying this to CBMs allows researchers to observe how concept weights shift (e.g., reducing weights for "ladder" in "painting" if it correlates with gender), providing a transparent view of the debiasing process.
Future Direction: The work suggests that while label-free CBMs are scalable, they may be less interpretable than human-curated CBMs due to the noise in LLM-generated concepts. However, combining Top-k filtering with Adversarial Debiasing represents a decisive step toward fair, interpretable, and high-performing image classification.