Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder

Imagine you have a very smart, super-observant robot named CLIP. This robot has read millions of books and looked at billions of photos on the internet. It's so good at its job that if you show it a picture of a person, it can guess their profession (like "Doctor" or "Nurse") just by looking at them.

But there's a problem. Because it learned from the internet, it also learned our human prejudices. If you show it a picture of a female doctor, it often guesses "Nurse" instead. If you show it a young person, it might guess "Student," but an older person gets guessed as "Retired."

The Big Mystery:
We knew the robot was biased, but we didn't know where inside its brain the bias was hiding. Was it in the part that recognizes faces? The part that understands clothes? Or the part that decides the final answer?

This paper is like a detective story where the authors try to find the exact "neuron" (or in this case, the "attention head") responsible for these unfair guesses.

The Detective Tools: How They Found the Culprit

The authors used three clever tools to investigate the robot's brain:

The "Residual Stream" (The Brain's Highway):
Think of the robot's brain as a busy highway where information flows. At every exit ramp (called a "layer"), there are 16 different lanes (called "attention heads"). Each lane is like a specialized worker. One lane might be looking for "shiny things," another for "blue things," and another for "people with stethoscopes."
The authors realized they could stop the traffic in specific lanes to see what happens.
The "Zero-Shot Concept Detector" (The Lie Detector):
Usually, to test if a lane is biased, you'd need to train a new robot to check it. But these authors were clever. They used the robot's own language skills. They asked the robot: "Does this lane care more about the word 'Male' or the word 'Doctor'?"
If a lane cares more about "Male" than "Doctor" when looking at a picture of a doctor, that lane is likely the one causing the bias. It's like finding a worker who is more interested in the person's gender than their job title.
The "Bias Dictionary" (The Expanded Vocabulary):
They gave the robot a special dictionary that included not just visual words (like "red," "round," "car") but also demographic words (like "Man," "Woman," "Young," "Old"). They forced the robot to compare every lane against these words. If a lane kept shouting "Woman!" when looking at a female doctor, they flagged it as a suspect.

The Investigation Results

They tested this on 42 different jobs. Here is what they found:

1. The Gender Bias: A Single Bad Apple
When looking at gender bias (Male vs. Female), they found that the bias wasn't spread out everywhere. It was concentrated in just four specific lanes near the very end of the robot's brain.

The Smoking Gun: One specific lane, named L23H4 (think of it as "Lane 23, Worker 4"), was responsible for almost all the trouble.
The Experiment: They "muted" this specific worker (turned it off).
The Result: Suddenly, the robot got much better at guessing "Doctor" for women! The bias dropped, and the robot actually became smarter overall.
The Catch: It's like fixing a leaky pipe. When they stopped the water from leaking into the "Nurse" bucket, it stopped filling up the "Nurse" bucket, but now the "Doctor" bucket was filling up correctly. The robot didn't become perfectly neutral; it just stopped making that specific unfair mistake.

2. The Age Bias: A Foggy Mess
When they looked at age bias (Young vs. Old), the story was different.

They found some suspect lanes, but when they muted them, nothing changed much.
The Analogy: Gender bias was like a single, loud alarm bell ringing in the wrong room. Age bias was like a fog that covered the whole building. You can't just turn off one switch to clear the fog; the information is scattered everywhere, making it much harder to fix with this method.

The "Aha!" Moment

The most important discovery is that bias lives in specific, tiny parts of the AI's brain.

For Gender: It's like a specific switch that says, "If the person looks like a woman, guess Nurse." The authors found that switch and turned it off.
For Age: It's like a general haze. The robot doesn't have one switch for "Old people"; it just has a general vibe that gets mixed into many different decisions.

Why This Matters

This paper is a breakthrough because it moves us from saying "AI is biased" to saying "Here is exactly where the bias is."

Before: We knew the car was driving off the road, but we didn't know if it was the steering wheel, the tires, or the engine.
Now: We know it's the steering wheel, and we know exactly which bolt is loose.

However, the authors warn us: Turning off the bias switch isn't a magic cure. If you turn off the "Gender Bias" switch, the robot might start making different mistakes. It's like fixing a leak in one part of a boat; you might stop the water from coming in there, but the boat still needs a full repair to be truly safe.

In short: The authors built a microscope to see exactly where AI gets unfair. They found that for gender, the problem is small and fixable. For age, the problem is deep and messy. This gives us a roadmap for how to build fairer AI in the future.

1. Problem Statement

Foundation models, particularly Vision-Language Models (VLMs) like CLIP, exhibit systematic demographic biases (e.g., misclassifying female doctors as nurses) due to biases in their training data. While standard fairness audits can quantify that a model is biased (e.g., via accuracy disparities across groups), they fail to explain where inside the network these biases originate. Current mechanistic interpretability tools have identified heads responsible for visual features (color, texture) or factual knowledge, but none have successfully localized demographic bias (gender, age) to specific attention heads in discriminative vision encoders. Furthermore, it is unknown whether such biases are concentrated in specific components or encoded diffusely across the architecture.

2. Methodology

The authors propose a mechanistic fairness audit pipeline that combines three core techniques to locate bias at the individual attention-head level in the CLIP ViT-L-14 encoder:

A. Projected Residual-Stream Decomposition

The model is treated as a residual stream where the final image representation is decomposed into additive contributions from individual attention heads and MLP blocks. The output of each head is projected into the joint vision–language embedding space, allowing for the analysis of what specific semantic information each head contributes to the final prediction.

B. Zero-Shot Concept Activation Vectors (CAV)

The authors adapt CAVs to a zero-shot setting without requiring labeled concept images.

Text Prototypes: They define text prototypes for 42 occupations and 6 demographic attributes (Male, Female, Non-binary, Young, Middle, Older) using synonym lists.
Alignment Scoring: For each attention head, they compute the cosine similarity between the head's visual centroid (averaged over images of a specific profession) and the text prototypes.
Head Ranking: A head is flagged as a "bias candidate" if its activation aligns more strongly with a demographic prototype (e.g., "Female") than with the relevant occupation prototype (e.g., "Doctor"), provided it meets thresholds for directional specificity and task relevance.

C. Bias-Augmented TextSpan Analysis

To provide human-readable validation, the authors extend the TextSpan algorithm's dictionary. They augment the standard 3,497 visual concepts with the 42 occupation and 6 demographic prototypes. The algorithm identifies the top explanatory texts for each head's variance. If a demographic text (e.g., "Gender female") appears in the top explanations for a head, it serves as qualitative corroboration of the quantitative CAV ranking.

D. Validation via Mean Ablation

To confirm causality, the authors perform mean ablation on the identified heads. The specific head's output is replaced by its mean value across the dataset.

Metric: They measure the change in Cramér's V (a measure of association between demographic groups and predicted classes) and overall accuracy.
Control: A layer-matched random control is used (ablating random heads from the same layers) to ensure that bias reduction is specific to the identified heads and not a generic result of reducing attention capacity.

3. Key Contributions

Diagnostic Methodology: A novel framework for locating demographic bias at the attention-head level by injecting demographic prototypes into the TextSpan dictionary and using zero-shot CAVs to rank heads.
Feasibility Demonstration: Proof that specific attention heads in CLIP's vision encoder are responsible for gender bias. Ablating these heads reduces global bias while slightly improving accuracy.
Attribute-Specific Localization Insights: Evidence that the "localizability" of bias varies by attribute. Gender bias is concentrated in a few specific heads, whereas age bias appears to be encoded more diffusely, resisting localization via this method.

4. Key Results

Gender Bias Findings

Identification: The pipeline identified four terminal-layer heads (specifically in layers 21–23) responsible for gender bias.
Ablation Impact:
- Global Bias: Reducing the identified heads lowered the global Cramér's V from 0.381 to 0.362 (a ~5% relative reduction).
- Accuracy: Overall accuracy improved from 64.30% to 64.72%.
- Specificity: A layer-matched random control showed no significant bias reduction ( $\Delta V \approx 0$ ), confirming the effect is specific to the identified heads.
Dominant Head: A single head, L23H4 (Layer 23, Head 4), accounted for 87% of the bias reduction in the "Doctor" class and 90% in the "Craftsman" class.
Prediction Redistribution:
- Doctor Class: Female doctor accuracy increased from 13.4% to 26.3%, with predictions shifting toward the correct class rather than random noise.
- Trade-off: The "Nurse" class saw a decrease in accuracy for female images (as the model stopped over-predicting "Nurse" for female doctors), illustrating that ablation redistributes predictions rather than creating a perfectly neutral model.

Age Bias Findings

Identification: The pipeline identified candidate heads for age bias.
Ablation Impact: Ablating these heads produced weak and inconsistent effects.
- Global bias reduction was negligible ( $\Delta V = -0.002$ ).
- In the "Guard" class (the most age-biased), ablation actually increased bias slightly ( $\Delta V = +0.009$ ).
Conclusion: Age bias in this architecture is likely encoded diffusely across many components rather than being routed through a small set of specific heads.

Cross-Attribute Entanglement

Head L23H4 appeared in both gender and age rankings. Its annotations included gender descriptors. Ablating it reduced gender bias but slightly increased age bias in the "Guard" class, suggesting that some heads encode demographic information that cuts across multiple protected attributes.

5. Significance and Limitations

Significance:

Mechanistic Insight: This work moves beyond "black box" fairness auditing to provide a mechanistic understanding of how and where bias is computed in VLMs.
Diagnostic vs. Debiasing: The authors clarify that mean ablation is a diagnostic tool, not a deployment-ready debiasing strategy. Removing a bias-carrying head can simply shift the bias to another direction (e.g., from "Doctor" to "Nurse") rather than eliminating it.
Attribute Variance: The study highlights that different protected attributes (gender vs. age) may require different analytical approaches, as they are encoded differently within the same model.

Limitations:

Dataset Constraints: The "Non-binary" group was too small ( $N=55$ ) for statistical testing, limiting analysis to Male/Female comparisons.
Layer Focus: The analysis focused on terminal layers (20–23); earlier layers with low input-specific variance might contain undetected bias.
Circularity Risk: The threshold selection for head identification used the evaluation metric (Cramér's V), though this was mitigated by random controls and independent TextSpan analysis.
Granularity: Mean ablation removes the entire head contribution, preventing the isolation of specific demographic components within a head.

Conclusion

The paper successfully demonstrates that demographic bias in CLIP's vision encoder can be localized to specific attention heads, particularly for gender. However, the difficulty in localizing age bias suggests that bias encoding strategies are heterogeneous. Future work must account for the redistribution effects of interventions and develop methods to handle diffusely encoded biases.