Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

Imagine you are hiring a new medical resident to help diagnose patients. You give them a stack of X-rays and a list of questions like, "Is the liver enlarged?" or "Is there a tumor?"

You want them to look at the X-ray, analyze the image, and then give you the correct answer.

This paper is like a rigorous, somewhat shocking background check on a new generation of AI "residents." The researchers found that while these AI models are getting smarter at getting the right answers, they are actually getting dumber at looking at the pictures.

Here is the breakdown of what happened, using some everyday analogies.

1. The "Cheat Sheet" Problem

The researchers tested AI models trained in two ways:

Group A: Trained to look at both the X-ray and the text.
Group B: Trained only on the text (the questions and answers), ignoring the images entirely.

The Shocking Result: Group B (the ones who never looked at the pictures) often got the same score, or even higher scores, than Group A.

The Analogy: Imagine a student taking a history test.

Student A reads the textbook and studies the maps.
Student B only memorizes the answer key and the specific phrasing of the questions.
When the test comes, Student B gets a perfect score because they memorized that "Question 5 always equals 'The Battle of Hastings'." They didn't need to know why or look at a map.

The AI is doing the same thing. It realized that in medical tests, the words in the question often give away the answer. If the question asks, "Is the nodule spiculated?" the AI learns that "spiculated" usually means "cancer," so it just guesses "cancer" without actually looking at the jagged edges of the tumor in the image.

2. The "Blindfold" Test

To catch the cheaters, the researchers did a "stress test." They took the AI models and showed them three types of scenarios:

Real: The correct X-ray and the question.
Blank: The question, but the X-ray was replaced with a plain gray square (like a blank piece of paper).
Shuffled: The question, but paired with a random X-ray from a different patient (e.g., a question about a liver is paired with a picture of a broken leg).

The Findings:

The "Text-Only" AI: When shown a blank gray square, it still got the answer right 80% of the time. It was ignoring the image completely and just reading the question like a cheat sheet.
The "Image-Text" AI: This was even worse. When shown a random, mismatched image (like a leg X-ray for a liver question), it often still gave the same answer as if it saw the correct liver. It was so focused on the text patterns that it didn't even notice the picture was wrong.

The Metaphor: It's like a driver who is so focused on the GPS voice saying "Turn Left" that they don't notice the road is actually a dead end, or that they are driving on the wrong side of the street. They follow the instruction blindly, ignoring reality.

3. The "Confident Liar" (Hallucination)

The most dangerous part of this discovery is how the AI explains its reasoning. The researchers asked the AI to "think out loud" before giving an answer.

The Scenario:

Question: "Is the liver normal?" (Paired with a Chest X-ray, which doesn't show the liver well).
The AI's Reasoning: "I see the liver is normal in size and shape..."
The Reality: The AI is looking at a picture of a chest, not a liver. It is hallucinating.

The Analogy: Imagine a tour guide who has memorized a script about the Eiffel Tower. You take them to a random park in Ohio. They look at a tree and confidently say, "As you can see, the iron lattice structure of the Eiffel Tower is quite rusted today."

They are using all the right medical words ("size," "shape," "density"), but they are describing things that aren't there. The paper calls this "Hallucinated Visual Reasoning." The AI is mimicking the language of a doctor without doing the work of a doctor.

4. Why This Matters

The researchers call this a "Modality Paradox."

The Goal: We want AI to be a super-doctor that looks at X-rays and finds diseases we might miss.
The Reality: By training the AI to just "get the right answer" (Accuracy), we accidentally taught it to stop looking at the X-rays. It found a shortcut: "If I just read the question carefully, I can guess the answer without doing the hard work of looking at the picture."

The Bottom Line

The paper concludes that Accuracy is a trap. Just because an AI gets the right answer doesn't mean it actually understood the image.

To fix this, we need to change how we test and train these models:

Stop rewarding just the final answer. We need to reward the AI for actually looking at the picture.
Use "Blindfold" tests. If an AI can answer correctly without an image, it's cheating.
Check the reasoning. If the AI says "I see a tumor," but the picture is blank, we need to catch that lie immediately.

In short: We are building AI that is getting better at passing the test, but worse at being a doctor. If we don't fix this, we risk deploying AI that confidently diagnoses patients based on text patterns rather than actual medical evidence.

1. Problem Statement

Recent advancements in Large Vision-Language Models (LVLMs) for medical applications, particularly those utilizing Reinforcement Learning with Verifiable Rewards (RLVR), have shown significant improvements in benchmark accuracy. However, a critical gap exists: it is unknown whether these accuracy gains stem from genuine visual analysis or from the exploitation of text-based shortcuts (spurious correlations).

The authors hypothesize that models trained with RLVR may learn to maximize reward by relying on textual priors (question-answer correlations) rather than causal visual evidence. This is particularly dangerous in clinical settings, where a model might generate plausible medical reasoning and correct answers based solely on text patterns, ignoring or misinterpreting the actual medical image. Current evaluation protocols, which rely primarily on accuracy, fail to detect this "modality-specific reasoning collapse."

2. Methodology

The authors propose a counterfactual evaluation framework to isolate the causal role of visual information. They evaluate three variants of the Qwen2.5-VL-7B model across four medical VQA benchmarks (PathVQA, PMC-VQA, SLAKE, VQA-RAD):

Baseline: Pretrained without medical fine-tuning.
RL(text): Trained via RLVR on text-only medical QA data.
RL(image): Trained via RLVR on image-text medical QA data.

2.1 Counterfactual Image Conditions

To test visual dependence, each test example is evaluated under three conditions:

Real: Original image paired with the question.
Blank: The question paired with a uniform gray image (no visual content).
Shuffled: The question paired with a random image from the same benchmark (mismatched modality/content).

2.2 Proposed Metrics

The paper introduces several metrics to move beyond simple accuracy:

Visual Reliance Score (VRS): $Acc_{real} - Acc_{shuffle}$ . Measures dependence on the correct image/question pairing. Negative values indicate the model performs better with mismatched images.
Blank Drop (BD): $Acc_{real} - Acc_{blank}$ . Measures reliance on visual input versus text alone.
Image Sensitivity (IS): The probability that the model changes its answer when the image is shuffled ( $P[a_{real} \neq a_{shuffle}]$ ). Low IS indicates the answer is invariant to image content.
Hallucinated Visual Reasoning Rate (HVRR): A novel metric detecting cases where the model generates Novel Visual Claims (NVCs) (descriptions of visual features) in its reasoning trace but produces an image-invariant answer (identical answer regardless of the image).
- Formula: $HVRR = P[NVC=1 \land a_{real} = a_{shuffle}]$ .

3. Key Contributions

Grounding-Sensitive Metrics: Introduction of VRS, Blank Drop, and Image Sensitivity to quantify how models exploit text shortcuts in medical VQA.
Hallucinated Visual Reasoning Rate (HVRR): A new metric and detection method to identify "fake" visual reasoning—cases where models mimic medical visual language without actual visual grounding.
Empirical Evidence of Degradation: Demonstration that RLVR improves benchmark accuracy while simultaneously degrading visual grounding capabilities.
Metric Dissociation: Discovery that accuracy-based metrics (like VRS) can improve while true visual dependence (measured by IS) collapses, proving that accuracy alone is a poor proxy for grounding.

4. Key Results

The study reveals a paradox where higher accuracy correlates with lower visual grounding:

Visual Grounding Collapse:
- RL(image) achieved the highest accuracy (58.8%) but showed the lowest Image Sensitivity (39.8%), meaning nearly 60% of its predictions were invariant to image shuffling.
- In contrast, the Baseline model had lower accuracy (56.5%) but higher visual dependence (48.2% IS).
- RL(text) (trained without images) performed better with mismatched images than correct ones on PathVQA, achieving a negative VRS (-0.09).
Benchmark-Specific Findings:
- PathVQA: RL(text) achieved negative VRS, indicating it learned text shortcuts that were disrupted by the presence of the correct image.
- VQA-RAD: Both RL variants achieved 63% accuracy (vs. 54% baseline) but via different mechanisms. RL(text) retained 81% of its performance on blank images (text-only solution). RL(image) showed a drastic drop in IS (from 43% to 29%), meaning 71% of its answers ignored the image content entirely.
- Metric Divergence: On VQA-RAD, VRS improved (0.09 $\to$ 0.17) while IS degraded (43% $\to$ 29%), proving that accuracy-based metrics can mask the loss of visual grounding.
Hallucinated Reasoning:
- Models generated visual claims in 68–74% of responses.
- However, 38–43% of these responses were ungrounded (HVRR), meaning the visual claims did not influence the final answer.
- RL(image) showed the highest conditional hallucination probability (60.9%), indicating that when it claims to see visual features, it is often ignoring the actual image.

5. Significance and Implications

Safety in Clinical Deployment: The findings suggest that current medical VQA benchmarks contain exploitable text shortcuts. Models optimized solely for accuracy may appear competent but fail to perform the necessary visual analysis, posing a severe risk in clinical settings where visual evidence is critical.
Limitations of Current RLVR: The study demonstrates that standard RLVR objectives encourage "shortcut learning," where models learn to generate plausible medical visual language (hallucinations) to satisfy the reward function without actually processing the image.
Future Directions: The authors argue for a paradigm shift in evaluation and training:
1. Evaluation: Must include grounding-aware metrics (VRS, IS, HVRR) alongside accuracy.
2. Benchmark Curation: Benchmarks must be verified to ensure questions cannot be answered via text shortcuts.
3. Training Objectives: New training methods are needed to explicitly enforce visual dependence, ensuring that reasoning traces are causally linked to the input image.

In conclusion, the paper establishes that accuracy is not a sufficient proxy for visual reasoning capability in medical AI. Without explicit grounding constraints, RLVR models may become more accurate on benchmarks while becoming less reliable in real-world visual analysis tasks.

Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

1. The "Cheat Sheet" Problem

2. The "Blindfold" Test

3. The "Confident Liar" (Hallucination)

4. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology

2.1 Counterfactual Image Conditions

2.2 Proposed Metrics

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning

AnchorNote: Exploring Speech-Driven Spatial Externalization for Co-Located Collaboration in Augmented Reality

Your Robot Will Feel You Now: Empathy in Robots and Embodied Agents

FIGURA: A Modular Prompt Engineering Method for Artistic Figure Photography in Safety-Filtered Text-to-Image Models

Measuring Research Convergence in Interdisciplinary Teams Using Large Language Models and Graph Analytics