Why Does It Look There? Structured Explanations for Image Classification

Imagine you have a brilliant but silent student taking a test. They get the right answers almost every time, but when you ask, "How did you know that was a cat and not a dog?" they just shrug. They can't explain their thought process. This is the problem with most modern AI: it's a "black box." It works great, but we don't know why it works.

Existing methods try to peek inside by highlighting the parts of a picture the AI looked at (like a high-lighter pen on a photo). But this is messy. It's like seeing a student underline words in a book but not knowing if they underlined them because they were important, or just because they liked the color. It doesn't tell you the story of how the student learned.

This paper introduces a new method called I2X (Interpretability to Explainability). Think of I2X as a detective that interviews the student at every stage of their training to build a structured story of how they learned.

Here is how it works, using simple analogies:

1. The "Lego Brick" Analogy (Prototypes)

Instead of looking at the whole picture at once, I2X breaks the AI's learning down into tiny building blocks called prototypes.

Imagine the AI is learning to recognize the number "7".
It doesn't just see a "7." It learns to recognize specific "Lego bricks" that make up a 7: a diagonal line in the middle, a dot at the top, a horizontal line at the bottom.
I2X identifies these specific patterns (bricks) and names them. Let's call them "Brick A," "Brick B," and "Brick C."

2. The "Training Diary" (Tracking Evolution)

Most AI explanations just look at the finished product. I2X is different; it keeps a diary of the AI's entire training journey.

It checks the AI at different checkpoints (like checking a student's progress at the end of every week).
It asks: "At week 1, did you use 'Brick A' to guess '7'? Did you get it right?"
"At week 4, did you start using 'Brick B'?"
By tracking these changes, I2X builds a timeline: "First, the AI learned to spot the diagonal line. Then, it learned to ignore the top dot because that confused it with the number '1'."

3. The "Confused Student" (Finding Uncertainty)

Sometimes, the AI gets confused. Maybe it thinks a "7" looks like a "2" because they both have a curve.

I2X spots the specific "Brick" that is causing the confusion. Let's say it's "Brick X," which looks like a curve found in both numbers.
The paper shows that if the AI sees "Brick X" too often without clear context, it gets shaky in its decisions. It's like a student who keeps flipping between two answers because one clue fits both.

4. The "Smart Tutor" (Fixing the AI)

This is the coolest part. Once I2X finds the "confusing brick," it doesn't just report it; it helps fix the AI.

The researchers took the AI and gave it a special "tutoring session."
They showed the AI examples that didn't have the confusing "Brick X," helping it learn to ignore that specific trap.
The Result: The AI became much better at telling the difference between the numbers (or cats and dogs in the case of the CIFAR-10 dataset). It reduced its mistakes significantly.

Why This Matters

Think of current AI as a magic trick. You see the rabbit appear, but you don't know how the magician did it.

Old methods say: "The magician looked at the rabbit's left ear." (This is the "saliency map" or highlighting).
I2X says: "The magician first practiced pulling the rabbit from the left sleeve, then realized the hat was too big, so he switched to the right sleeve, and finally learned to hide the rabbit in the cape. Here is the step-by-step manual."

The Big Takeaway

This paper gives us a way to turn the AI's "black box" into a transparent instruction manual. It shows us exactly how the AI organizes its thoughts, where it gets confused, and how we can gently nudge it to learn better. It's not just about explaining the past; it's about using that explanation to make the AI smarter and more reliable for the future.

1. Problem Statement

Deep learning models, particularly in image classification, achieve high predictive performance but suffer from a "black-box" nature, limiting transparency and trust. While numerous Explainable AI (XAI) methods exist, they primarily offer unstructured interpretability (e.g., saliency maps like GradCAM or concept vectors). These methods answer where a model looks but fail to explain why it focuses there or how it organizes these features for inference.

Current approaches often rely on auxiliary models (e.g., LLMs like GPT or CLIP) to verbalize these features, which compromises faithfulness to the original model and can introduce hallucinations. The authors identify a gap: the lack of a method to extract structured explainability directly from the model's internal behavior, specifically tracking how the model evolves its decision-making process across training checkpoints.

2. Methodology: The I2X Framework

The authors propose Interpretability to Explainability (I2X), a framework that transforms unstructured post-hoc interpretations into structured explanations by tracking model evolution. The process involves four key stages:

A. Prototype Extraction (Unstructured to Abstract)

Instead of treating saliency maps as raw heatmaps, I2X aggregates them into Abstract Prototypes.

Feature Extraction: A trained model $M$ is decomposed into a feature extractor $f$ and a classifier head $g$ .
Clustering: The spatial latent features $F$ from all training samples are flattened and clustered using K-Means (after PCA) to identify $K$ centroids. These centroids represent recurring patterns (prototypes) recognized by the model.
Alignment: For any input image, the saliency map (generated via GradCAM) is aligned with these prototypes.

B. Prototype Intensity Quantification

For a specific training checkpoint $t$ , the Prototype Intensity $P^t_k$ for the $k$ -th prototype is calculated by aggregating the saliency values of spatial locations assigned to that prototype. This quantifies the activation strength of each prototype for a given sample.

C. Confidence-Prototype Mapping

The framework tracks the evolution of both Prototype Intensity and Model Confidence across training checkpoints:

Confidence Change: The change in prediction confidence ( $\Delta \hat{y}$ ) between consecutive checkpoints is calculated.
Clustering Confidence Patterns: Samples exhibiting similar patterns of confidence change are grouped using HDBSCAN (a density-based clustering algorithm).
Regression Modeling: A Ridge Regression model is trained to map the evolution of prototype intensity ( $\tilde{P}$ ) to the evolution of model confidence ( $C$ ). The resulting coefficient matrix $\beta$ quantifies how changes in specific prototypes drive changes in class confidence.

D. Assembling Structured Explanations

By aggregating these relationships over the entire training process, I2X constructs a structured narrative:

Shared Prototypes: Patterns consistently present across all samples of a class that drive confidence.
Specialized/Uncertain Prototypes: Patterns that appear in subsets of samples or fluctuate in their contribution (e.g., sometimes supporting Class A, sometimes Class B), indicating sources of confusion or ambiguity.

3. Key Contributions

Structured Explainability: I2X moves beyond static saliency maps to provide a dynamic, causal-like view of how models learn and distinguish classes over time.
Faithful Representation: Unlike methods using external LLMs, I2X derives explanations intrinsically from the model's own training trajectory and confidence shifts.
Diagnostic Capability: The framework identifies "uncertain prototypes"—features that confuse the model by activating for multiple classes or fluctuating in importance.
Optimization Guidance: I2X is not just for analysis; it provides a mechanism to guide fine-tuning. By identifying uncertain prototypes, the authors propose a targeted perturbation strategy to remove or re-weight samples containing these confusing features, thereby improving model accuracy and stability.

4. Experimental Results

The framework was evaluated on MNIST and CIFAR-10 using ResNet-50 and InceptionV3.

Learning Trajectory Visualization:
- On MNIST (Digit 7), I2X revealed that the model first distinguishes 7 from clear outliers (like 2 and 6) using specific structural prototypes (e.g., diagonal strokes).
- It later resolves ambiguous cases (like 1 and 9) only after sufficient evidence accumulates.
- Data Order Sensitivity: Experiments showed that changing the random seed (data ordering) leads to different prototype selection sequences and inference strategies, significantly affecting which prototypes become "uncertain" and the final confusion matrix.
Performance Improvement via Fine-Tuning:
- MNIST (2 vs. 7): By identifying an uncertain prototype (P-17) causing confusion between 2 and 7, the authors curated a dataset excluding samples with P-17. Fine-tuning on this curated dataset reduced the confusion between 2 and 7 from 14.80 to 9.80 (a ~33% reduction) while maintaining high overall accuracy.
- CIFAR-10 (Cat vs. Dog): An uncertain prototype (P-72) representing the edge between black and orange regions was found to confuse orange cats and dogs. Removing these samples reduced cat-dog confusion from 261.20 to 238.60 and improved accuracy from 81.43% to 84.02%.
- InceptionV3 (MNIST): Similar improvements were seen in reducing confusion between digits 4 and 9 caused by a specific arc prototype.

5. Significance and Conclusion

The paper demonstrates that structured explainability is achievable without relying on external generative models. I2X bridges the gap between "interpretability" (seeing the heat) and "explainability" (understanding the logic).

Its primary significance lies in its practical utility:

It transforms XAI from a passive diagnostic tool into an active optimization guide.
It allows researchers to pinpoint exactly which visual features cause model failure (uncertain prototypes) and systematically remove them to improve robustness.
It highlights the critical impact of training data ordering on the internal logic of deep learning models.

Future work aims to integrate I2X with "explain-by-design" architectures (like ProtoPNet) to reduce reliance on post-hoc clustering and further generalize the concept of prototype uncertainty.