Interpretable Debiasing of Vision-Language Models for Social Fairness

Imagine you have a very smart, super-fast librarian named Vision-Language Model (VLM). This librarian has read billions of books and looked at billions of photos. Because they learned from the real world, they also learned the world's stereotypes.

If you ask this librarian, "Show me a picture of a CEO," they might only show you pictures of men in suits, even though women are CEOs too. If you ask, "Is this person a nurse?" they might say "No" if the person looks like a man, because they've learned that nurses are usually women.

The problem is that this librarian is a "Black Box." We can see what they answer, but we don't know why they are giving those biased answers. It's like trying to fix a broken clock without being allowed to open the back to see the gears.

The Problem with Current Fixes

Most people try to fix this librarian by:

Rewriting their memory: Forcing them to re-learn everything from scratch (very expensive and slow).
Putting a filter on their eyes: Telling them, "Don't look at gender," but this often makes them forget how to do their job properly (like forgetting how to tell a cat from a dog).

These methods are like trying to fix a leaky pipe by painting over the wall. The water (bias) is still leaking inside, and the wall might start crumbling (the model gets worse at its job).

The New Solution: DEBIASLENS

The authors of this paper built a tool called DEBIASLENS. Think of it as a high-tech X-ray glasses that lets us see the tiny gears inside the librarian's brain without taking the clock apart.

Here is how it works, using a simple analogy:

1. The "Neuron" Garden

Imagine the librarian's brain is a giant garden with millions of tiny plants (called neurons).

Some plants help the librarian recognize a "cat."
Some plants help them recognize "sadness."
Unfortunately, some plants have grown wild and are only triggered by "men" or "women" in specific jobs. These are the Bias Plants.

2. The "Sparse Autoencoder" (SAE) - The Gardener's Lens

The researchers used a special tool called a Sparse Autoencoder (SAE). Think of this as a super-smart gardener who can look at the garden and say:

"Ah, I see that specific plant over there? It only lights up when we talk about 'female nurses.' That's a Bias Plant!"

Usually, these plants are tangled up with other plants (like "nurse" and "female" are mixed together). The SAE untangles them, separating the "nurse" concept from the "female" concept. It finds the specific plant responsible for the bias.

3. The "Debiasing" - Turning Down the Volume

Once the gardener finds the Bias Plants, they don't rip them out (which might damage the garden). Instead, they just turn the volume down on those specific plants.

Before: When you ask about a CEO, the "Male CEO" plant screams at 100% volume.
After DEBIASLENS: The "Male CEO" plant is muted to 10% volume. The librarian still knows what a CEO is, but they don't automatically assume it's a man.

Why is this special?

It's Transparent: We know exactly which plants we turned down. We aren't guessing.
It's Precise: We only touch the bias plants. The plants that help the librarian recognize cats, dogs, and math problems stay loud and clear.
It Works Everywhere: It works on both the "eyes" (image recognition) and the "voice" (text understanding) of the librarian.

The Results

The researchers tested this on two famous librarians (CLIP and InternVL).

Before: When asked to find a "CEO," the model showed men 90% of the time.
After: With DEBIASLENS, the model showed men and women much more equally (closer to 50/50).
Best of all: The librarian didn't get "dumber." They were still just as good at answering questions and recognizing objects; they just stopped making unfair assumptions.

In a Nutshell

DEBIASLENS is like a surgeon's scalpel for AI. Instead of smashing the whole machine to fix a small problem, it gently identifies the tiny, biased gears inside, turns them down, and lets the machine run smoothly and fairly. It makes AI more trustworthy by making its "thought process" visible and fixable.

1. Problem Statement

Vision-Language Models (VLMs) and Large VLMs (LVLMs) have achieved remarkable capabilities but inherit and often amplify societal biases present in their training data. These biases manifest in two primary ways:

Text-to-Image (T2I) Retrieval: Models like CLIP retrieve skewed demographic distributions (e.g., predominantly male images) for neutral prompts like "A photo of a CEO."
Visual Question Answering (VQA): Models like InternVL provide definitive, biased answers to ambiguous questions (e.g., assuming a specific gender for a profession) rather than acknowledging uncertainty.

Limitations of Existing Methods: Current debiasing approaches rely on post-hoc learning (fine-tuning, prompt tuning) or test-time algorithms (pruning, prompt engineering). These methods suffer from critical limitations:

Lack of Interpretability: They treat the model as a black box, failing to identify where or how bias is encoded internally.
Performance Trade-offs: Techniques like weight pruning often reduce bias at the cost of significantly degrading the model's general reasoning capabilities.
Surface-Level Fixes: They mitigate symptoms without altering the underlying internal representations responsible for bias propagation.

2. Methodology: DEBIASLENS

The authors propose DEBIASLENS, a model-agnostic, interpretable framework that localizes and modulates "social neurons" within VLMs without retraining the base model weights. The method consists of three stages:

A. Sparse Autoencoder (SAE) Training

Architecture: An SAE is attached to the last layer of the VLM's image or text encoder.
Objective: The SAE decomposes the entangled feature space of the frozen VLM encoder into a sparse, high-dimensional latent space. It is trained to reconstruct the original features while enforcing sparsity.
Data: The SAE is trained on facial image or caption datasets (e.g., FairFace, Cocogender) without using explicit social attribute labels during the training phase. The goal is to let the SAE naturally discover latent features corresponding to demographics.
Loss Function: Uses a multi-scale reconstruction loss (Matryoshka SAE) and an auxiliary loss to ensure accurate reconstruction at different levels of sparsity.

B. Social Neuron Probing

Hypothesis: Social bias arises from consistent correlations with specific demographics. Therefore, specific neurons in the SAE latent space will activate consistently for a specific group (e.g., "female") but not others.
Selection Process:
1. Effectiveness: Identify neurons that activate non-zero for a high proportion ( $\tau$ ) of samples within a specific group $g$ .
2. Specificity: Calculate the set difference between effective neurons of group $g$ and the union of effective neurons of all other groups. This isolates neurons unique to group $g$ .
3. Ranking: Select the top neurons with the highest mean activation values within the target group.
Outcome: A set of "social neurons" ( $Z_B$ ) is identified for attributes like gender, age, and race.

C. Social Neuron-Modulated Inference

Deactivation: During inference, the activation values of the identified social neurons in the latent vector are set to zero (or modulated by a factor $\gamma$ ).
Reconstruction: The modified latent vector is passed through the SAE decoder to generate a bias-free reconstructed feature ( $\hat{v}$ ).
Feature Mixing: To preserve general semantic knowledge while removing bias, the final feature $v'$ is a weighted sum of the original feature $v$ and the reconstructed feature $\hat{v}$ :
$v' = \alpha \hat{v} + (1 - \alpha)v$
where $\alpha$ controls the trade-off between debiasing strength and general performance.

3. Key Contributions

First Interpretable Framework: Introduces the first debiasing framework applicable to both VLMs and LVLMs that operates at the neuron level, providing transparency into which internal components encode bias.
Model-Agnostic & Label-Free (Training): The method does not require retraining the massive VLM or using demographic labels during the SAE training phase (labels are only used for probing/selection).
Preservation of General Performance: By selectively deactivating only specific neurons rather than pruning weights or fine-tuning, the method mitigates bias while maintaining high performance on general reasoning tasks.
Disentanglement of Social Attributes: Demonstrates that SAEs can successfully disentangle social attributes (gender, age, race) into distinct, monosemantic neurons, allowing for targeted mitigation of specific biases.

4. Experimental Results

The authors evaluated DEBIASLENS on CLIP (ViT-B/16, ViT-L/14) and InternVL2 / LLaVA-1.5.

Bias Mitigation (VLMs):
- Achieved a 9–16% reduction in Max Skew for CLIP image retrieval across various prompts (adjectives, occupations, activities).
- Outperformed or matched state-of-the-art methods (e.g., SANER, Prompt, Projection) while offering interpretability.
Bias Mitigation (LVLMs):
- Reduced gender disproportion rates in VQA tasks by 40–50% on the InternVL2 model.
- Improved the model's ability to answer "unsure" or "cannot be determined" for ambiguous questions, rather than making biased definitive guesses.
General Performance:
- Maintained high scores on general benchmarks (ImageNette, MME, MMMU, Seed-Bench).
- Achieved the best trade-off score compared to pruning and full fine-tuning methods.
Interpretability Validation:
- Neuron Specificity: Deactivating "gender neurons" significantly reduced gender bias but had minimal impact on age or race bias (and vice versa), confirming the disentanglement.
- Visual Evidence: Top-activating images for selected neurons clearly corresponded to specific social concepts (e.g., specific hairstyles for gender, specific age groups), validating human interpretability.
Ablation Studies:
- Found that FairFace is the optimal dataset for training SAEs to find robust social neurons.
- Identified an optimal weight proportion ( $\alpha \approx 0.6$ ) that balances bias reduction with general capability.

5. Significance and Future Work

Trustworthy AI: DEBIASLENS transforms bias mitigation from a "black-box" correction into a transparent, auditable intervention. This is crucial for deploying AI in high-stakes domains like assistive technologies.
Efficiency: Unlike fine-tuning, which is computationally expensive and risks catastrophic forgetting, DEBIASLENS is a lightweight, inference-time intervention.
Limitations: The method relies on the quality of the SAE training data (currently limited to facial attributes). It assumes biases can be cleanly disentangled into single neurons, which may not hold for complex intersectional biases (e.g., race + gender + age).
Future Directions: The authors suggest expanding to more diverse, culturally specific datasets and exploring hierarchical SAEs to handle complex intersectional biases.

In summary, DEBIASLENS offers a principled, interpretable, and effective solution to social bias in Vision-Language Models by leveraging Sparse Autoencoders to isolate and neutralize specific "social neurons" without compromising the model's core intelligence.