Caption-Driven Explainability: Probing CNNs for Bias via CLIP

The Big Problem: The "Smart" Student Who Cheats

Imagine you hire a student to take a history test. You want them to learn the actual dates and events. But, you accidentally give them a practice test where every question about "World War II" is printed in red ink, and every question about "The Renaissance" is in blue ink.

The student studies hard and gets 100% on the practice test. You think, "Great! They are a history genius!"

But then, you give them the real test, where the ink colors are random. The student fails miserably. Why? Because they didn't learn history; they learned to look for red ink. They were "cheating" by relying on a shortcut (a bias) rather than the actual subject matter.

In the world of Artificial Intelligence (AI), this is called a Covariate Shift. The AI learns a trick that works in the lab but fails in the real world. The problem is, we often don't know what trick the AI is using until it's too late.

The Old Solution: The "Flashlight" (Saliency Maps)

Traditionally, to see what an AI is looking at, we use something called a Saliency Map. Think of this like shining a flashlight on an image to see which pixels are "glowing" the most.

The Flaw: If the AI is looking at a red ink spot and the actual text at the same time, the flashlight just shows a big, blurry red blob. It can't tell you if the AI is reading the words or just looking at the color. It's like trying to figure out if a chef is tasting the salt or the pepper when they are both sprinkled on the same spot.

The New Solution: The "Translator" (Caption-Driven XAI)

The authors of this paper propose a new method called Caption-Driven Explainability. Instead of just shining a flashlight, they use a "Translator" to ask the AI what it's thinking.

Here is how they do it, step-by-step:

1. The Setup: Two Different Brains

Brain A (The Standalone Model): This is the AI we are testing (like our "cheating" student). It's good at recognizing numbers (5s and 8s) but might be biased.
Brain B (CLIP): This is a super-smart AI that has read millions of books and seen millions of pictures. It understands the connection between words (like "red," "green," "circle," "square") and images.

2. The Surgery: Swapping the Brains

The researchers perform a digital "brain surgery." They take the internal parts of Brain A (the part that actually looks at the image) and swap them into Brain B.

The Analogy: Imagine taking the eyes of our "cheating student" and plugging them into the head of the "super-smart translator."
Now, the super-smart translator is looking at the image through the eyes of the cheating student.

3. The Test: Asking the Translator

Now, they show the image to this hybrid system and ask it to guess what it sees using specific captions (descriptions). They ask:

"Is this a red digit?"
"Is this a green digit?"
"Is this a circle?"
"Is this a square?"

Because the "eyes" belong to the cheating student, the translator will get very excited about the color if the student is biased. If the student was actually looking at the shape, the translator would get excited about the shape.

The Results: Catching the Cheat

In their experiment, they used a dataset where all the "5s" were red and all the "8s" were green.

Before the fix: The "Translator" (using the cheating student's eyes) screamed "RED!" and "GREEN!" It ignored the shapes entirely. The method successfully proved: This AI is a cheater; it only looks at color.
The Fix: The researchers removed the color from the images (turned them black and white) and retrained the student.
After the fix: They did the surgery again. This time, the "Translator" screamed "CIRCLE!" and "SQUARE!" (or rather, the shapes of the digits).

Why This Matters

This method is like a lie detector test for AI.

Old way: You see the AI is confused, but you don't know why.
New way: You can ask the AI, "Are you looking at the color or the shape?" and get a clear answer.

If you are building an AI for a hospital to diagnose diseases, you don't want it to be a "cheating student" that only looks at the color of the X-ray film to make a diagnosis. You want it to look at the actual bone or organ. This new method helps doctors and engineers catch these "cheating" AI models before they are deployed, ensuring they are robust, fair, and actually looking at the right things.

In short: They built a way to translate an AI's "visual thoughts" into human language, revealing whether the AI is smart or just relying on a lucky shortcut.

1. Problem Statement

The paper addresses the critical issue of model robustness and bias detection in Machine Learning (ML), specifically within Convolutional Neural Networks (CNNs) used for computer vision.

The Limitation of Saliency Maps: Current state-of-the-art Explainable AI (XAI) methods, such as Grad-CAM and LIFT-CAM, generate saliency maps to highlight pixels that excite a model. However, these methods can be misleading when spurious features (irrelevant but correlated features) and salient features (relevant features) overlap in the same pixel space.
Covariate Shift: Models often fail in real-world deployment due to covariate shifts, where the data distribution during training differs from the deployment environment. A model might learn a shortcut (e.g., relying on color rather than shape) because the training data is biased, leading to incorrect predictions when that bias is removed in the real world.
The Gap: Existing methods struggle to identify the dominant concept (e.g., color vs. shape) driving a model's decision when these concepts are entangled in the pixel space.

2. Methodology: Caption-Driven XAI via Network Surgery

The authors propose a novel Caption-Driven XAI method that integrates a standalone CNN into the CLIP (Contrastive Language-Image Pre-training) framework using a technique called Network Surgery.

Core Concept

Instead of analyzing pixel activations directly, the method translates the internal logic of the standalone model into the semantic concept space of CLIP. By "surgically" replacing parts of CLIP's image encoder with the standalone model's layers, the system can query the model using natural language captions to determine what concepts it prioritizes.

Technical Workflow

Architecture Integration (Network Surgery):
- Target: A standalone ResNet-50 model (trained on a biased MNIST dataset) is to be explained.
- Base: CLIP's image encoder (a modified ResNet-52).
- The Swap: The authors perform a "surgery" where activation maps from the standalone model are swapped into the CLIP image encoder.
- Selection Strategy: To preserve CLIP's learned concept space while injecting the standalone model's properties, only the last convolutional layers of stages 2, 3, 4, and 5 of the CLIP encoder are eligible for swapping. The first stage remains untouched to preserve low-level feature extraction common to both models.
Activation Matching:
- Since the standalone model (49 layers, 22,720 activation maps) and CLIP encoder (51 layers, 3840 swappable maps) have different numbers of activation maps, a selection process is required.
- Normalization: Activation maps are normalized using mean ( $\mu$ ) and standard deviation ( $\sigma$ ) to handle scale differences.
- Resizing: Smaller activation maps are upscaled using bilinear interpolation to match dimensions.
- Correlation: A correlation coefficient ( $Z_{ij}$ ) is computed between every activation map of the standalone model and the target CLIP layers.
- Matching: The algorithm selects the subset of standalone activation maps with the highest correlation scores to replace the corresponding CLIP layers, ensuring the most similar feature representations are transferred.
Caption-Based Probing:
- Once the "hybrid" model is constructed, it is fed images while specific text captions (e.g., "a red digit," "a green digit," "a shape of a five," "a shape of an eight") are processed by CLIP's text encoder.
- Cosine Similarity: The system calculates the cosine similarity between the image embedding (from the hybrid model) and the text embeddings.
- Bias Detection: If the similarity scores are highest for color-related captions (e.g., "red"), the model is identified as color-biased. If highest for shape-related captions, it is shape-focused.

3. Key Contributions

Novel Network Surgery Approach: A method to transfer the internal representation of a standalone CNN into a pre-trained multimodal model (CLIP) without retraining the entire system from scratch.
Concept-Level Explainability: Moves beyond pixel-level saliency maps to identify high-level semantic concepts (e.g., color vs. shape) that drive model predictions.
Robustness to Overlapping Features: The method effectively disentangles spurious correlations (color) from true features (shape) even when they occupy the same pixel space, a known failure point for traditional saliency maps.
Zero-Shot Bias Detection: The method leverages CLIP's zero-shot capabilities to probe the model against a set of captions without requiring additional labeled data for the explanation task.

4. Experimental Results

The method was validated using a biased MNIST dataset containing digits '5' and '8'.

Setup:
- Training Data: All '5's were red, and all '8's were green.
- Real-world Data: Colors were random.
- Goal: Determine if the model learned the digit shape or the color.
Findings:
- Biased Model: The caption-driven XAI correctly identified color as the dominant concept. The cosine similarity scores were significantly higher for color captions than shape captions.
- De-biasing: The authors removed the color feature (converted images to grayscale) and retrained the model.
- Unbiased Model: When the new grayscale model was subjected to the same surgery and probing, the dominant concept shifted to shape.
Visualization: The aggregated results clearly showed a shift in the dominant concept from color to shape, confirming the removal of the bias.

5. Significance and Conclusion

Pre-deployment Safety: This method serves as a critical diagnostic tool before deploying ML models in high-stakes environments (e.g., medicine), ensuring models rely on relevant features rather than dataset artifacts.
Superiority over Saliency Maps: The paper argues that this approach is superior to saliency maps in scenarios where features overlap, as it provides a semantic interpretation rather than just a heatmap.
Paradigm Shift: The authors posit that understanding the "dominant concepts" in CNNs is fundamental to improving robustness. They suggest that such XAI methods should be a prerequisite for deploying any machine vision model, not just a debugging tool.
Future Application: The technique holds promise for validating medical AI, where it could confirm or debunk a doctor's preconceived notions about what features a model is actually using for diagnosis.