Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology

Imagine you have a giant, high-resolution photograph of a city (a Whole Slide Image from a microscope). This photo is so huge it contains millions of tiny neighborhoods (called patches). Your goal is to predict something about the whole city, like "Is there a crime happening here?" or "How long will this city survive?"

To do this, you build a super-smart AI detective (a Multiple Instance Learning or MIL model). The AI looks at all the tiny neighborhoods, picks out the clues, and makes a prediction about the whole city.

But here's the problem: How do we know the AI is actually looking at the crime scene and not just looking at a weird stain on the lens or a shadow?

To answer this, the AI draws a Heatmap. It paints the city in red where it thinks the clues are. Doctors and scientists have been using these heatmaps for years to trust the AI. But this paper asks a scary question: "What if the heatmap is lying?"

The Big Discovery: The "Fake Map" Problem

The authors found that the most popular way to draw these maps—called Attention Heatmaps—is often like a magician's trick. The AI might say, "I'm looking at the red house!" but actually, it's just guessing based on the color of the roof, not the crime itself. The map looks pretty, but it doesn't tell the truth about how the AI is thinking.

The Solution: A "Truth Test" for Maps

The researchers created a new Truth Test (called Patch Flipping) to see which map-drawing method is honest.

The Analogy:
Imagine you are trying to guess what's inside a sealed box by looking at it.

The Old Way (Attention): You ask the AI, "What are you looking at?" and it points to a spot. You trust it blindly.
The New Truth Test: You take the AI's map, find the spots it says are important, and physically remove them from the photo.
- If the map is honest, removing those spots should make the AI completely confused and change its answer.
- If the map is lying, removing the spots won't change the AI's answer at all. It means the AI was looking somewhere else entirely!

The Race: Who Draws the Best Map?

The authors ran a massive race with six different map-drawing methods across ten different medical tasks (like detecting cancer, predicting survival, or finding genetic mutations). They tested them on different types of AI brains (some based on Transformers, some on Attention, some on new Mamba tech).

The Results:

The Losers: The famous Attention Heatmaps (the ones everyone uses) usually failed the test. They were often no better than a random guess. They looked nice but didn't reflect the AI's actual logic.
The Winners: Three methods consistently drew the truth:
1. Single: A method that tests one neighborhood at a time.
2. LRP (Layer-wise Relevance Propagation): A method that traces the "relevance" of every clue back to the source, like following a breadcrumb trail.
3. IG (Integrated Gradients): A method that measures how much the answer changes if you nudge the image slightly.

The Takeaway: If you want to know why an AI made a medical decision, don't just look at the "Attention" map. Use LRP or Single instead. They are the honest reporters.

Real-World Superpowers

Once the researchers found the "honest maps," they used them to do cool things that were impossible before:

1. The "X-Ray Vision" for Genes
They trained an AI to guess a patient's gene expression (like a chemical recipe inside the cells) just by looking at the tissue slide.

The Magic: They used the honest heatmap to see where on the slide the AI was looking.
The Proof: They compared this map to a real, expensive lab test called Spatial Transcriptomics (which actually measures genes in specific spots).
The Result: The AI's "honest map" matched the real gene locations perfectly! This means we can now use cheap microscope slides to "see" genes without doing expensive lab tests.

2. Finding Hidden Clues for HPV
They looked at head and neck cancer slides to predict HPV infection.

The Discovery: By using the honest maps, they found that the AI wasn't just looking at one thing. It had different strategies for different patients:
- For some, it looked for heavy inflammation (immune cells).
- For others, it looked at the shape of the tumor cells.
- For a few, it found a pattern that human doctors missed entirely.
The Impact: This helps doctors understand that the disease might look different in different people, and the AI can spot these subtle patterns.

The Bottom Line

This paper is like a "Consumer Reports" for AI in medicine. It tells us:

Don't trust the flashy maps (Attention) just because they are popular.
Test your maps using the "Truth Test" (Patch Flipping) to make sure they are honest.
Use the honest methods (LRP, Single, IG) to unlock new biological discoveries and make medical AI safer and more reliable.

By switching to the right tools, we can stop guessing and start truly understanding what our AI doctors are thinking.

Here is a detailed technical summary of the paper "Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology."

1. Problem Statement

In computational histopathology, Multiple Instance Learning (MIL) is the standard framework for processing Whole Slide Images (WSIs). WSIs are gigapixel images divided into thousands of patches (instances), where only a subset contains predictive information. MIL models aggregate these patches to make slide-level predictions (e.g., tumor detection, biomarker prediction, survival analysis).

To interpret these "black-box" models, researchers rely on heatmaps to visualize which tissue regions influenced the prediction. The most common approach is extracting attention scores directly from the model. However, the validity of these heatmaps is rarely investigated.

The Core Issue: Attention heatmaps often fail to faithfully reflect the model's decision-making process. They may highlight regions that are necessary but not sufficient for prediction, or they may fail to distinguish between evidence for and against a prediction.
The Gap: There is a lack of systematic, label-free frameworks to evaluate the quality of MIL explanations or to determine which explanation method is best suited for specific tasks (classification, regression, survival) and architectures (Attention, Transformer, Mamba).

2. Methodology

The authors propose a comprehensive framework for evaluating and comparing explanation methods without requiring ground-truth annotations for the patches.

A. Evaluation Framework: Patch Flipping

To assess the faithfulness of a heatmap (i.e., how well it reflects the model's actual strategy), the authors adapted the "patch flipping" perturbation test:

Ranking: Patches are ranked based on their explanation scores (relevance) in both ascending (least to most relevant) and descending (most to least relevant) orders.
Perturbation: Patches are progressively removed from the input bag in these orders.
Metric Calculation: The model's prediction is recomputed after each removal.
- Descending Order: Removing the most relevant patches should cause a rapid drop in prediction confidence.
- Ascending Order: Removing irrelevant patches should cause a slow change.
Symmetric Relevance Gain (SRG): The evaluation metric is the area between the two perturbation curves. A higher SRG indicates a more faithful heatmap. This metric is computed per slide and aggregated statistically across cohorts.

B. Explanation Methods Evaluated

The study benchmarks six distinct explanation methods across three categories:

Attention-based: Direct extraction of attention weights (Attn).
Gradient-based: Integrated Gradients (IG), Gradient $\times$ Input (G $\times$ I), and Squared Gradients (Grad $^2$ ).
Perturbation-based: The "Single" method (predicting with a bag containing only one instance).
Propagation-based: Layer-wise Relevance Propagation (LRP), adapted for MIL to handle attention and Mamba layers.

C. Experimental Design

Scope: A large-scale benchmark involving 10 histopathology datasets.
Tasks: Classification (tumor detection, subtyping, biomarker prediction), Regression (gene expression, HRD scores), and Survival Analysis.
Architectures: Three MIL aggregation models: Attention-MIL, Transformer-MIL (TransMIL), and Mamba-MIL.
Backbones: Two state-of-the-art foundation models: UNI2 and Virchow2.
Statistical Analysis: Pairwise comparisons using Wilcoxon signed-rank tests and effect sizes, plus rank-based analysis (Mean Rank Score) to identify consistent performance trends.

3. Key Contributions

A Label-Free Evaluation Framework: Introduced a robust, model-agnostic method (Patch Flipping/SRG) to quantitatively assess heatmap faithfulness without needing patch-level ground truth.
Large-Scale Benchmarking: Conducted the first systematic comparison of six explanation methods across diverse task types, architectures, and foundation models in histopathology.
Identification of Superior Methods: Demonstrated that Attention heatmaps are generally unfaithful to the model's strategy. Instead, Single (perturbation), LRP, and IG consistently outperform attention-based and naive gradient methods.
Task-Architecture Specific Recommendations:
- Transformer-based MIL: LRP is the most faithful method.
- Attention/Mamba-based MIL: Single (perturbation) is generally the best, particularly for regression and survival tasks.
- General Alternative: IG serves as a robust, architecture-agnostic alternative when implementation complexity is a concern.
Biological Validation & Discovery:
- Validated that faithful heatmaps correlate with spatial transcriptomics data for gene expression prediction.
- Discovered distinct model strategies for predicting HPV status in head and neck cancer, revealing that models can rely on different histological markers (e.g., lymphocyte infiltration vs. anatomical location) for the same prediction.

4. Key Results

Faithfulness Hierarchy: The methods consistently grouped into two tiers:
- High Faithfulness: Single, LRP, IG.
- Low Faithfulness: Attention (Attn), G $\times$ I, Grad $^2$ , and Random baselines.
- Finding: Attention heatmaps often performed no better than random baselines in terms of faithfulness, despite being the standard "explanation" in the field.
Task Dependence:
- In Regression and Survival tasks, Single and LRP showed the strongest performance.
- In Classification, IG performed slightly worse than LRP/Single but better than Attention.
Biological Correlation: In the spatial transcriptomics use case, heatmaps generated by Single, LRP, and IG showed strong Pearson correlations (0.54–0.78) with ground-truth gene expression spatial distributions. In contrast, Attention heatmaps were noisy and failed to capture spatial patterns.
Strategy Discovery: By combining LRP heatmaps with quantifiable histological features (cell counts, tissue compartments), the authors identified that the model used different strategies for HPV prediction: some slides relied on tumor-infiltrating lymphocytes, while others relied on the anatomical location (tonsil) or specific neoplastic cell patterns not typically associated with HPV by pathologists.

5. Significance and Impact

Paradigm Shift: The paper challenges the widespread reliance on attention heatmaps in digital pathology, arguing they are often misleading. It advocates for a shift toward perturbation-based (Single) or propagation-based (LRP) methods.
Reliable Model Validation: By providing a quantitative, label-free way to validate heatmaps, the work enables more rigorous trust in AI models before clinical deployment.
Biological Insight: The study proves that high-quality explanations can be used to discover novel biological associations (e.g., linking model strategies to specific cell types or anatomical sites) and validate molecular biomarker predictions against spatial omics data.
Practical Guidelines: The authors provide concrete, actionable recommendations for researchers and clinicians on which explanation method to select based on their specific MIL architecture and clinical task, moving the field from "qualitative visual inspection" to "quantitative, principled selection."

Code Availability: The authors have released their code and implementation of the xMIL framework on GitHub to facilitate reproducibility and broader adoption.