How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images

Imagine you hire a brilliant medical student to look at X-rays and MRI scans. This student has read every medical textbook in the library and can recite complex disease definitions from memory. They are a genius at what to look for.

But here's the problem: When you point to a specific spot on an X-ray and ask, "Is there a tumor here?", the student looks at the wrong part of the image entirely. They might stare at the edge of the frame or a random shadow, ignore the actual tumor, and confidently say, "No, everything looks fine."

This is exactly what researchers discovered about Medical Multimodal Large Language Models (MLLMs) in this new paper.

Here is the breakdown of the paper using simple analogies:

1. The Problem: The "Distracted Doctor"

For a long time, we thought these AI models failed because they didn't know enough medicine (like a student who hasn't studied). But this paper found that's not the main issue.

The real problem is Visual Grounding.

The Analogy: Imagine a detective looking at a crime scene photo. A good detective looks at the muddy footprints near the window. A bad detective looks at the vase on the table and says, "The killer must have come through the window because the vase is blue."
The Reality: The AI knows the words for "pneumonia" or "fracture," but when it tries to find them in the picture, its "eyes" (attention) wander off to the wrong spots. It's like a student who knows the answer but is looking at the wrong question on the test sheet.

The researchers found that while these AI models are great at looking at pictures of cats, dogs, and cars (natural scenes), they get completely lost when looking at medical scans. They fail to "ground" their answers in the actual relevant parts of the image.

2. The New Tool: "VGMED" (The Specialized Test)

To prove this, the researchers couldn't just use old tests. Old medical tests often asked questions like, "What kind of machine took this picture?" (which you can answer without looking closely at the organs).

So, they built a new test called VGMED.

The Analogy: Think of this as a "spot the difference" game designed by real doctors. Instead of asking general questions, they showed the AI a specific box drawn around a liver or a lung and asked, "Is this specific spot swollen?" or "Is this specific spot dark?"
The Result: They tested 8 of the smartest medical AI models on this new test. Almost all of them failed. They couldn't keep their eyes on the specific box the doctors pointed to. They were looking at the wrong things.

3. The Solution: "VGRefine" (The Attention Span Fix)

The researchers didn't just want to point out the problem; they wanted to fix it without retraining the AI (which is expensive and slow). They invented a method called VGRefine.

The Analogy: Imagine the AI is a student taking a test, but they are easily distracted by the noise outside the window.
- Step 1 (Attention Triage): The researchers ask the AI, "Hey, which parts of your brain are actually looking at the right spot?" They find the specific "neurons" (attention heads) that are doing a good job.
- Step 2 (Attention Knockout): Then, they put a blindfold over the parts of the AI that are looking at the wrong stuff. They effectively say, "Stop looking at the vase; look only at the footprints."
The Result: By simply telling the AI to ignore the distractions while it is thinking (at inference time), the AI suddenly gets much better at answering questions. It didn't need to learn new facts; it just needed to focus better.

4. The Big Takeaway

This paper changes how we think about AI in medicine.

Before: We thought AI failed because it wasn't "smart" enough or didn't have enough medical data.
Now: We know that even the smartest models fail because they can't look at the right place. They have the knowledge, but they lack the focus.

In a nutshell: The paper says, "Don't just give the AI more textbooks. Teach it to look at the right spot on the X-ray." And they showed a simple way to do exactly that, making the AI significantly more reliable for doctors.

1. Problem Statement

While Multimodal Large Language Models (MLLMs) have achieved success in general vision-language tasks, their performance in medical domains, particularly in zero-shot settings, remains suboptimal. A critical research gap exists in understanding why these models fail. Previous assumptions often attributed failures to a lack of clinical knowledge (semantic grounding). However, this paper hypothesizes that the primary bottleneck is inadequate visual grounding—the inability of medical MLLMs to accurately localize and interpret clinically relevant regions within medical images, even when they possess the necessary medical knowledge.

Unlike general-domain MLLMs (e.g., LLaVA-v1.5) which can successfully ground predictions in natural scene images, medical MLLMs frequently attend to spurious or irrelevant regions when analyzing medical scans (CT, MRI, X-ray, etc.), leading to incorrect diagnoses or descriptions.

2. Methodology

The authors propose a systematic investigation framework consisting of three main components: a new evaluation dataset, new quantitative metrics, and an inference-time refinement method.

A. VGMED: A New Evaluation Dataset

To disentangle visual grounding (locating relevant regions) from semantic grounding (knowing what to look for), the authors created VGMED (Visual Grounding analysis of MEDical MLLMs).

Construction: Co-created with three clinical experts (general practice, neurology, radiology).
Design: The dataset contains ~28,000 image-bounding box-question triplets derived from 40+ public medical segmentation datasets.
Question Types:
- Localization: "Is there a [organ/lesion] in the image?"
- Attribute: "Is the lesion homogeneous or heterogeneous?"
Constraint: Questions are strictly designed to require attention to the annotated bounding box region without requiring deep diagnostic reasoning or knowledge of the entire scan context. This ensures that failure is attributed to visual grounding, not semantic lack of knowledge.

B. Quantitative Metrics for Visual Grounding

The authors analyze internal attention maps from the LLM layers most relevant to visual grounding. They introduce three metrics to measure the alignment between the model's attention distribution and the ground-truth bounding box:

Attention Ratio (AR): The sum of attention inside the ground-truth box divided by the average attention in a box of the same size.
Kullback-Leibler (KL) Divergence: Measures the difference between the normalized attention map and the normalized ground-truth mask.
Jensen-Shannon (JS) Divergence: A symmetric, bounded metric to quantify distribution differences.

Goal: Lower divergence and higher AR indicate better visual grounding.

C. VGRefine: Inference-Time Refinement Method

To address the identified failure mode without retraining, the authors propose VGRefine, a two-step inference-time method:

Attention Triage:
- Identify the top $K$ attention heads across all layers that show the highest alignment with visual regions (using KL divergence on a natural image dataset to avoid data leakage).
- Aggregate these heads' attention maps.
- Apply magnitude-based filtering to suppress low-activation (noisy) regions, creating a high-confidence binary mask.
Attention Knockout:
- Apply the binary mask to the cross-attention weights between question tokens and visual tokens during the forward pass.
- This explicitly "knocks out" connections to clinically irrelevant visual tokens, forcing the model to focus only on the relevant regions identified in Step 1.

3. Key Contributions

Systematic Discovery: First study to systematically validate that inadequate visual grounding is a pervasive and fundamental failure mode in state-of-the-art (SOTA) medical MLLMs, distinct from semantic grounding issues.
VGMED Dataset: A novel, clinically validated dataset specifically designed to evaluate visual grounding capabilities, separating them from semantic reasoning.
VGRefine: A training-free, inference-time method that improves visual grounding by refining attention distributions, achieving SOTA performance across diverse benchmarks.
Domain Specificity: Demonstrated that while general-domain MLLMs fail on medical images due to grounding issues, they succeed on natural images; conversely, medical MLLMs fail on medical images but can ground well on natural images. This proves the issue is specific to the medical domain's visual complexity, not general model weakness.

4. Experimental Results

The study evaluated 8 SOTA medical MLLMs (including LLaVA-Med, HuatuoGPT-V, VILA-M3, MedRegA) and the proposed VGRefine method.

Visual Grounding Analysis:
- All evaluated medical MLLMs showed significantly lower Attention Ratios and higher KL/JS divergence on medical images compared to natural scene images.
- Even models with strong medical knowledge (e.g., HuatuoGPT-V) failed to attend to the correct anatomical regions in ~30-40% of cases.
Performance on Benchmarks:
- VGRefine was applied to HuatuoGPT-V (7B and 34B) and tested on 6 benchmarks: VQA-RAD, SLAKE, PathVQA, PMC-VQA, OmniMedVQA, and MMMU (Health & Medicine).
- Zero-Shot Gains:
  - VQA-RAD: +5.6% accuracy improvement.
  - PathVQA: +11.3% accuracy improvement.
  - OmniMedVQA: Significant boosts across 8 modalities (CT +7.5%, MRI +6.4%, X-Ray +8.1%).
  - MMMU: Improved overall average from 45.8% to 47.2%.
- VGRefine achieved SOTA performance across all benchmarks, outperforming larger models and other baselines without any additional training or external expert models.
Human Evaluation:
- A blinded study with 5 clinicians showed that 76% of VGRefine-generated attention maps were preferred over the baseline, cited as being "less noisy" and "more clinically reasonable."

5. Significance and Impact

Paradigm Shift: The paper challenges the prevailing view that medical MLLM failures are primarily due to a lack of medical knowledge. It establishes visual grounding as a critical, often overlooked bottleneck.
Practical Solution: VGRefine offers a lightweight, plug-and-play solution to improve model reliability in clinical settings without the computational cost of retraining or fine-tuning.
Diagnostic Tool: The proposed metrics and VGMED dataset provide a necessary diagnostic framework for future research to identify and mitigate grounding failures in medical AI.
Trustworthiness: By ensuring models attend to the correct anatomical structures, the method enhances the interpretability and trustworthiness of AI in clinical decision-making, a prerequisite for real-world deployment.

In conclusion, this work provides the first rigorous evidence that medical MLLMs struggle to "see" what they are talking about in medical images, and offers a simple yet effective mechanism to fix this, significantly advancing the state of medical AI.

How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images

1. The Problem: The "Distracted Doctor"

2. The New Tool: "VGMED" (The Specialized Test)

3. The Solution: "VGRefine" (The Attention Span Fix)

4. The Big Takeaway

1. Problem Statement

2. Methodology

A. VGMED: A New Evaluation Dataset

B. Quantitative Metrics for Visual Grounding

C. VGRefine: Inference-Time Refinement Method

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

Linear Programming for Multi-Criteria Assessment with Cardinal and Ordinal Data: A Pessimistic Virtual Gap Analysis

Seven simple steps for log analysis in AI systems