GIFT: A Framework Towards Global Interpretable Faithful Textual Explanations of Vision Classifiers

Imagine you have a super-smart robot that looks at pictures and decides what they are. Maybe it tells you if a face looks "Old" or "Young," or if a car in a video can "Turn Right" safely. But here's the problem: the robot is a black box. It gives you an answer, but it won't tell you why. It's like a chef who makes a delicious soup but refuses to tell you the recipe.

The paper introduces a new tool called GIFT (Global, Interpretable, Faithful, Textual explanations). Think of GIFT as a detective that interrogates the robot to find out its secret recipe, but it does so in a way that is honest, clear, and written in plain English.

Here is how GIFT works, broken down into four simple steps using a detective analogy:

The Detective's Four-Step Investigation

Step 1: The "What If?" Game (Counterfactuals)
Imagine you show the robot a picture of a person with glasses and it says, "This person is Old."
The detective (GIFT) asks the robot: "What if I took the glasses away? Would you still say they are Old?"
The robot tries to imagine the picture without glasses. If the robot suddenly changes its mind and says, "Actually, they look Young now," the detective knows: "Aha! The glasses are a clue!"

The Magic: GIFT does this thousands of times, changing tiny things in thousands of pictures (adding a red ball, removing a car, changing a color) to see what makes the robot change its mind. These are called counterfactuals.

Step 2: Translating the Clues (Vision-to-Text)
The robot's changes are just pixels. It's hard for humans to look at a slightly different picture and say exactly what changed.
So, GIFT brings in a translator (a Vision-Language Model). This translator looks at the "Before" and "After" pictures and writes a sentence.

Instead of: "Pixel values in the top-left quadrant shifted by 15%."
The Translator says: "The tiny red metal ball behind the brown block was removed."
Now, the clues are in words we can understand.

Step 3: Finding the Pattern (The "Aha!" Moment)
The detective now has a huge pile of notes: "Removed glasses -> became Young," "Added wrinkles -> became Old," "Removed red ball -> changed class."
Individually, these notes are messy. But GIFT uses a super-smart brain (a Large Language Model, like the one behind this chat) to read all the notes and find the pattern.

It realizes: "Wait, every time the robot changes its mind, it's because of glasses or wrinkles."
It drafts a global rule: "This robot thinks people are Old if they have glasses or forehead wrinkles."

Step 4: The Lie Detector Test (Verification)
Here is the most important part. Sometimes, the super-smart brain might guess wrong or make things up. GIFT doesn't just trust the guess; it tests it.
The detective goes back to the robot and says: "I think your rule is 'Glasses = Old'. Let's test it."
GIFT uses an image editor to take a picture of a young person, add glasses, and show it to the robot.

If the robot says "Old": The rule is TRUE. The glasses really caused the change.
If the robot says "Young": The rule is FALSE. The robot was just confused, and the glasses didn't matter.
This step ensures the explanation is Faithful—it's not just a guess; it's a proven fact.

Why is this a big deal?

Most other methods are like asking a human to guess the recipe by looking at the soup. They might say, "It probably has salt," but they could be wrong.

Old methods often give vague heatmaps (like a blurry red spot on a picture) that don't tell you what the object is.
GIFT gives you a clear sentence: "The robot is biased because it thinks cars on the left side of the road mean 'Don't Turn Right'."

Real-World Examples from the Paper

The "Red Metal Ball" Test: In a toy world with blocks, GIFT figured out that a robot was trained to look for "Red Metal Objects." It didn't just say "Red"; it figured out it needed to be metal too.
The "Old Face" Test: On a dataset of human faces, GIFT found that the robot was looking for wrinkles and glasses. But it also found a weird bias: the robot thought a "detailed background" made a person look old! (Maybe the training photos of old people were taken in busy rooms).
The "Self-Driving Car" Test: This is the most critical one. The researchers tested a robot trained to decide if a car can turn right. They found the robot had a hidden bias: if there were cars in the left lane, the robot said "No Turn," even if the road was clear. GIFT found this hidden rule and proved it with the Lie Detector test. Without GIFT, this dangerous bias might have gone unnoticed.

The Bottom Line

GIFT is a framework that turns a black-box robot into a transparent partner.
It doesn't just guess why the robot made a decision; it plays "What If," translates the results into English, finds the big patterns, and then proves those patterns are true by editing the images and testing the robot again.

It's like having a detective who doesn't just tell you who the culprit is, but shows you the fingerprint, the motive, and the alibi, all written in plain English. This makes AI safer and more trustworthy for things like self-driving cars and medical diagnosis.

1. Problem Statement

Deep learning vision models are increasingly deployed in high-stakes domains (e.g., autonomous driving, medical imaging), yet their decision-making processes often remain opaque ("black boxes"). Existing explainability methods suffer from significant limitations:

Feature Attribution (Saliency Maps): Provide local, instance-level insights but lack faithfulness (often highlighting spurious correlations) and are difficult for humans to interpret semantically.
Model Surrogates: Approximate the model with interpretable ones but often fail to capture the true decision boundary, leading to unfaithful explanations.
Concept-Based Methods: Often require predefined concepts (limiting discovery of unexpected biases) or produce non-interpretable vectors.
Counterfactual Explanations: While faithful (causal), they are inherently local (specific to one image), often ambiguous (a visual change could imply multiple semantic causes), and difficult for non-experts to interpret without translation.

The Gap: There is a lack of a framework that can generate Global (model-level), Interpretable (natural language), Faithful (causally verified), and Textual explanations for vision classifiers.

2. Methodology: The GIFT Framework

GIFT is a post-hoc framework that bridges local counterfactual reasoning with global interpretability through four distinct stages (see Figure 1 in the paper):

Stage 1: Faithful Visual & Local Explanations

Goal: Generate local, faithful visual counterfactuals.
Mechanism: For a set of input images $I$ , a counterfactual generator (CEX) creates minimal, semantically meaningful modifications ( $x \to x'$ ) that flip the classifier's prediction ( $M(x) \neq M(x')$ ).
Key Property: Because these are generated by directly probing the model's decision boundary, they are faithful by definition, capturing causal links rather than correlations.

Stage 2: From Visual Counterfactuals to Text

Goal: Translate visual changes into natural language.
Mechanism: A Vision-Language Model (VLM) or specialized Change Captioning model (CC) analyzes the pair $(x, x')$ and generates a text description of the differences (e.g., "The red object became purple").
Challenge: This stage introduces potential noise and ambiguity, as a single visual change might have multiple semantic interpretations.

Stage 3: Candidate Global Explanations

Goal: Aggregate local signals into global hypotheses.
Mechanism: A Large Language Model (LLM) ingests the set of all change captions and the corresponding label flips. It identifies recurrent patterns, disambiguates local evidence, and synthesizes a set of candidate global rules (e.g., "Class 1 is defined by the presence of a red metal object").
Innovation: The LLM acts as a reasoner to filter noise and generalize from local instances to global model behavior without direct access to the model weights.

Stage 4: Hypothesis Verification (Causal Filtering)

Goal: Quantitatively verify the faithfulness of the proposed global explanations.
Mechanism: GIFT performs image-based interventions to test if the proposed concept is a necessary and sufficient cause for the classification.
1. Coarse Filter (Correlation): Uses a Visual Question Answering (VQA) model to measure Directed Information (DI) between the concept and the class label.
2. Fine Filter (Causality): Uses a text-guided image editing model (EDIT) to intervene on a validation set (adding or removing the concept) and observes the change in the classifier's output.
Metrics:
- Causal Concept Effect (CaCE): Measures the average change in prediction probability when the concept is added/removed.
- Probability of Necessary and Sufficient Cause (PNS): Estimates the probability that the concept is both necessary and sufficient for the class outcome.
Outcome: Only explanations with high causal scores are retained, ensuring the final output is faithful to the model's true reasoning.

3. Key Contributions

First Framework for Global Textual Counterfactuals: GIFT is the first method to systematically derive global, natural language explanations for vision classifiers that are grounded in counterfactual evidence.
Causal Verification Pipeline: It introduces a rigorous two-step verification process (Correlation + Causal Intervention) to ensure explanations are not just correlated but causally linked to the model's decisions.
Novel Aggregation Strategy: It combines local, inherently causal counterfactual signals with LLM reasoning to uncover global insights, a synergy not previously explored.
Comprehensive Evaluation: The framework is validated across three distinct domains:
- CLEVR: Synthetic, compositional scenes (testing rule discovery).
- CelebA: Real-world faces (testing fine-grained attribute relationships).
- BDD-OIA: Complex driving scenes (testing bias discovery).

4. Experimental Results

CLEVR (Rule Discovery): GIFT successfully uncovered the hidden ground-truth classification rules in 11 out of 12 test cases (ResNet and ViT architectures). It correctly identified complex rules like "presence of a cyan object" or "yellow rubber object."
CelebA (Fine-Grained Analysis): The framework identified known attributes (e.g., "wrinkles," "receding hairline") and unexpected biases (e.g., "detailed background"). Crucially, it demonstrated that while individual attributes had low causal impact, combinations (e.g., "Glasses + Wrinkles") yielded high causal scores, revealing complex decision boundaries.
BDD-OIA (Bias Detection): GIFT successfully identified an injected bias where the classifier associated vehicles in the left lane with the "Cannot turn right" class.
- Comparison: Human inspection and LLM-only hypothesis generation (without counterfactuals) failed to detect this bias. GIFT was the only method to identify it, proving the necessity of the counterfactual-guided pipeline.
Ablation Studies: Removing Stage 2 (Change Captioning) or Stages 1-2 (Counterfactual generation) resulted in a complete failure to detect biases, highlighting that contrastive (pairwise) analysis is essential for distinguishing signal from noise.

5. Significance and Impact

Trust and Safety: By providing causally verified textual explanations, GIFT allows stakeholders to trust model decisions in safety-critical applications (e.g., verifying that an autonomous vehicle isn't relying on spurious correlations like "left-lane traffic").
Bias Mitigation: The framework acts as a diagnostic tool to uncover non-intuitive, data-driven biases that human experts might miss, enabling proactive mitigation.
Model-Agnostic Flexibility: GIFT does not require access to model weights and can be instantiated with various off-the-shelf generative and language models, making it adaptable to diverse domains.
Bridging the Gap: It effectively solves the "local vs. global" and "visual vs. textual" trade-offs in explainability, offering a principled path from instance-level counterfactuals to global, human-readable rules.

In conclusion, GIFT represents a significant step forward in Explainable AI (XAI) by moving beyond simple feature attribution to provide causally grounded, global, and natural language explanations for deep vision models.