A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification

Imagine you have hired a very smart, but somewhat mysterious, robot doctor to look at X-rays and tell you if a patient has a specific illness. This robot is great at its job most of the time, but sometimes it makes mistakes. The scary part is that you don't know why it makes those mistakes, or who it is most likely to get wrong. Maybe it only fails when the X-ray is taken from a certain angle, or maybe it gets confused when a patient has a metal tube in their chest.

This paper introduces a new "Safety Inspector" for these robot doctors. Instead of just waiting for the robot to fail and hoping you notice, this inspector automatically hunts down the robot's weak spots and explains them in plain English.

Here is how it works, broken down with some everyday analogies:

1. The Problem: The "Black Box" and the "Hidden Patterns"

Usually, when we want to check if a robot doctor is fair, we look at its "ID card" (metadata) like age or gender. But what if the robot fails on a group that doesn't have an ID card? What if it fails specifically on "X-rays taken at night" or "X-rays of patients with a specific type of breathing machine"?

Traditional methods are like trying to find a needle in a haystack by only looking at the color of the straw. They miss the needle if it's a different color. This paper says: "Let's look at the whole haystack, not just the straw."

2. The Solution: The "Multimodal Detective"

The authors built a framework that acts like a super-detective. This detective doesn't just look at the X-ray picture (the image); it also reads the doctor's notes (the text report) and checks the patient's file (the metadata).

The Old Way: The detective only looked at the picture.
The New Way: The detective looks at the picture, reads the notes, and checks the file all at once. This is called Multimodal.

Think of it like trying to understand a joke. If you only read the text, you might miss the punchline. If you only hear the tone of voice, you might miss the words. But if you have both the text and the tone, you get the full picture. This detective does the same for medical data.

3. How It Finds the Mistakes (The "Slice Discovery")

The detective uses a clever trick called Slice Discovery. Imagine you have a giant bag of mixed jellybeans. Some are red, some are blue, some are green. The robot doctor usually picks the right flavor, but sometimes it picks the wrong one.

The detective doesn't just count the wrong picks. It looks for clusters (slices) of jellybeans where the robot always gets it wrong.

Example: "Oh, look! Every time the robot sees a jellybean that is Red AND has a Blue wrapper, it gets confused!"
That combination (Red + Blue wrapper) is the "Error Slice."

The paper's innovation is that this detective can find these slices even if the "wrapper" isn't written in a database. It can figure out that the "Blue wrapper" is actually a specific type of medical device mentioned in the text report, even if the image just looks like a blob.

4. How It Explains the Mistakes (The "Why")

Once the detective finds the bad group, it needs to explain why to the humans. It uses a method called Token Analysis.

Imagine the robot is confused by a specific word. The detective scans all the reports where the robot failed and asks: "What word shows up way more often in the 'failed' pile than in the 'successful' pile?"

If the robot keeps failing on patients with a "chest tube," the detective will highlight the word "tube" as the culprit.
It then double-checks this by looking at the actual X-rays to make sure the word "tube" actually matches the picture.

This turns a confusing math error into a clear sentence: "The robot is failing because it gets confused by patients with chest tubes."

5. The Results: Why "More Info" is Better

The researchers tested this on a massive dataset of chest X-rays (MIMIC-CXR). They simulated three types of robot failures:

Spurious Correlation: The robot thinks a "tube" means "sick" just because it saw them together a lot in training.
Rare Slice: The robot never saw "side-view" X-rays before, so it fails on them.
Noisy Labels: The training data had some wrong answers mixed in.

The Big Takeaway:

Multimodal wins: Using the picture + the text + the metadata was the best way to find the errors. It's like having a team of experts (a radiologist, a nurse, and a data clerk) working together instead of just one person.
Text is powerful: Even without looking at the picture, just reading the text reports and metadata was surprisingly good at finding errors. This is huge because reading text is much cheaper and faster than processing complex images.
Noisy data is hard: When the training data was very messy (lots of wrong answers), the detective struggled a bit, but still found patterns that older methods missed.

In a Nutshell

This paper gives us a smart, automated safety net for AI doctors. It doesn't just tell us that the AI is failing; it tells us exactly where (which group of patients) and why (what specific feature, like a "tube" or a "side view," is causing the confusion). By combining images, text, and data, it makes AI in healthcare safer, fairer, and easier to trust.

1. Problem Statement

Despite significant advances in machine learning for medical image classification, the safety, reliability, and fairness of these systems remain critical concerns in clinical practice. Existing auditing methods face two primary limitations:

Metadata Dependence: Traditional subgroup analysis relies on metadata (e.g., patient demographics), which is often unavailable or insufficient to capture errors outside predefined categories.
Unimodal Limitations: Recent Slice Discovery Methods (SDMs) automate the identification of failure-prone subgroups but are typically restricted to image-only inputs. They fail to leverage the rich, multimodal nature of clinical data (e.g., radiology reports, DICOM metadata), resulting in limited interpretability and a reliance on manual inspection for failure explanations.

The paper addresses the need for a fully automated, black-box auditing framework that extends slice discovery to multimodal representations to detect systematic failures and generate clinically meaningful explanations without requiring access to model internals or expert annotations.

2. Methodology

The proposed framework operates as an independent third-party auditor. It consists of three main stages:

A. Problem Formulation

The goal is to identify "error slices"—coherent subsets of data where the model consistently underperforms. An error attribute is defined when the error rate on a specific attribute ( $S_{y, attr}$ ) is significantly higher than on its complement ( $S_{y, \neg attr}$ ).

B. Multimodal Error Identification

The framework extends the DOMINO algorithm (originally image-only) to a multimodal setting:

Unified Embedding Construction:
- Image: Extracted via a multimodal model.
- Text (Reports): Extracted via the text encoder of the same multimodal model.
- Metadata: DICOM tabular data is converted into short textual descriptions and encoded via the text encoder.
- Fusion: All modalities are concatenated with equal weights (no prior assumption of modality importance) and reduced via Principal Component Analysis (PCA) to preserve structural similarity.
Clustering with GMM:
- A Gaussian Mixture Model (GMM) is applied to the joint space of multimodal embeddings ( $u_i$ ), ground-truth labels ( $y_i$ ), and model predictions ( $\hat{y}_i$ ).
- The multiclass problem is reformulated as binary (target class vs. others) to simplify black-box auditing.
- The likelihood function is optimized to balance cluster error rate and cluster coherence using a parameter $\gamma$ .
- Samples with a high probability of belonging to a high-error cluster are assigned to an error slice.

C. Explanation Generation

To interpret the discovered slices, the framework uses a token-based analysis module:

Distinctiveness Scoring: Based on TF-IDF, the system compares tokens in the error slice ( $S_{err}$ $S_{er r}$ ) against a reference slice ( $S_{ref}$ $S_{r e f}$ ) of correctly predicted samples from the same class.
- $DS(t) = \mu_{err}(t) - \mu_{ref}(t)$
- Tokens with high scores are identified as potential error attributes (e.g., specific medical devices or view types).
Multimodal Validation: To ensure the textual tokens actually correlate with the visual failure patterns, a CLIP-style similarity metric is computed between the token and the image slices. A higher relative score ( $r_{attr}$ ) confirms the token is a valid cause of the systematic failure.

3. Experimental Setup

Dataset: MIMIC-CXR-JPG, a large multimodal chest X-ray dataset containing images, radiology reports, and metadata across 14 pathologies.
Embedding Model: BioMedCLIP was used to extract embeddings.
Failure Scenarios: Three controlled bias types were simulated during model training:
1. Spurious Correlation: Pneumothorax classifier biased by the presence of supporting devices ( $\rho=0.7$ ).
2. Rare Slice Undertraining: Cardiomegaly classifier biased by underrepresented "lateral view" samples ( $R=0.02$ ).
3. Noisy Label Injection: Finding classifier with 30% random label flips in the "frontal view" group.
Baselines: Compared against a Global TF-IDF analysis (no slice discovery) and various unimodal/multimodal embedding configurations.

4. Key Results

The framework was evaluated using Precision@10 (P@10) across 100 bootstrap iterations.

Multimodal Superiority: Multimodal embeddings generally outperformed image-only baselines.
- Spurious Correlation: Image + Metadata achieved the best P@10 (0.638), a ~15% improvement over Image Only (0.567).
- Rare Slice: Metadata-heavy embeddings performed best (up to 0.909), confirming that metadata explicitly captures view-position attributes.
- Noisy Labels: Report + Metadata achieved the highest P@10 (0.744), though this scenario remained the most challenging overall.
Interpretability:
- In the spurious correlation task, the system correctly identified tokens like "tube" and "line".
- In the rare slice task, "lateral" was the most distinctive token.
- In noisy label scenarios, the token "portable" (indicating AP/frontal X-rays) was consistently identified, whereas the baseline only found generic terms like "normal."
Robustness Insights:
- Text/Metadata Efficiency: Text-based modalities alone or combined with metadata showed strong potential, suggesting they can serve as efficient, computationally cheaper alternatives to image processing in resource-constrained settings.
- Noisy Label Challenge: Performance dropped significantly when underperforming samples were rare (20% of the test set) due to GMM clustering instability. However, increasing the underperforming proportion to 30% significantly improved results.
- Refinement: Preliminary experiments showed that applying GMM exclusively to misclassified samples (rather than the whole dataset) could increase Precision@5 by over 100% in noisy settings.

5. Significance and Contributions

First Multimodal SDM: This is the first work to extend Slice Discovery Methods to multimodal embeddings (Image + Report + Metadata) for medical auditing.
Black-Box Applicability: The framework requires no access to model internals, training data, or expert annotations, making it a practical tool for independent third-party auditing.
Systematic Failure Detection: It successfully identifies hidden systematic failures that unimodal or metadata-only approaches miss.
Automated Explanation: It bridges the gap between statistical failure detection and clinical interpretation by generating natural language explanations (tokens) validated by visual similarity.
Resource Efficiency: The results suggest that in specific scenarios (like rare slice detection), text and metadata can outperform image-only inputs, offering a pathway for more efficient auditing pipelines.

6. Future Directions

The authors propose addressing data sparsity in noisy-label settings by refining the clustering strategy (focusing only on misclassified samples) and exploring advanced fusion strategies beyond simple concatenation to reduce information loss.