MedQ-Deg: A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations

Imagine you have a brilliant medical student who has read every textbook, memorized every scan, and can diagnose a liver tumor in a perfect, crystal-clear X-ray with 100% confidence. This student is like today's Multimodal Large Language Models (MLLMs)—AI systems that can "see" medical images and "talk" about them.

But here's the problem: Real hospitals aren't perfect.

In the real world, patients move, machines are old, and scans can be grainy, blurry, or noisy. When you hand this brilliant student a blurry, noisy photo of a liver, they might still say, "I'm 95% sure this is a tumor!" even if the blur makes it impossible to tell. They are confidently wrong.

This paper, MedQ-Deg, is like a "stress test" designed to find out exactly how these AI doctors handle bad-quality images and, more importantly, whether they know when they are struggling.

🏥 The Problem: The "AI Dunning-Kruger Effect"

The authors discovered a scary phenomenon they call the AI Dunning-Kruger Effect.

In Human Psychology: This is when a person who isn't very good at something thinks they are a genius. They lack the self-awareness to know what they don't know.
In AI: When the image quality gets worse (like a photo taken in the dark), the AI's accuracy drops like a stone. But, its confidence stays high. It keeps saying, "I'm sure!" even as it starts making dangerous mistakes.

The Analogy: Imagine a GPS navigation app. On a clear day, it guides you perfectly. But when you drive into a thick fog (image degradation), the GPS starts giving you wrong turns. The scary part? It doesn't say, "I can't see the road, please drive carefully." Instead, it keeps shouting, "Turn left now!" with the same loud, confident voice it used on a sunny day. That is the AI Dunning-Kruger Effect.

🛠️ The Solution: MedQ-Deg (The Stress Test)

To fix this, the researchers built MedQ-Deg, a massive new testing ground. Think of it as a "Gym for AI Doctors" with three special features:

The "Dirty" Gym: They didn't just test the AI on perfect photos. They took 24,894 medical questions and intentionally "ruined" the images in 18 different ways (adding noise, blurring, motion artifacts, etc.) at three levels of severity:
- Level 0: Perfect image.
- Level 1: A little bit of noise (like a smudge on the lens).
- Level 2: Very bad quality (like a photo taken through a foggy window).
The "Skill Tree": They didn't just ask one type of question. They tested 30 different medical skills, from "What bone is this?" (Anatomy) to "What medicine should we give?" (Treatment).
The "Confidence Meter": They didn't just check if the answer was right. They measured how sure the AI was. This is the key to catching the "Dunning-Kruger" effect.

🔍 What They Found (The Results)

After testing 40 different AI models (including big names like GPT-4, Gemini, and specialized medical AIs), they found some shocking truths:

The "Cliff" Effect: Most AIs handle a little bit of noise okay. But once the image gets really bad (Level 2), their performance doesn't just slide down; it crashes off a cliff. They go from being helpful to being useless very quickly.
The "Confidence Trap": Every single model tested suffered from the AI Dunning-Kruger Effect. As the images got worse, the models got more confident in their wrong answers. They didn't realize they were failing.
Specialists vs. Generalists: You might think a medical-specialized AI would be better. Surprisingly, they performed similarly to general-purpose AIs. None of them were truly "robust" against bad images.
The Hardest Part: The AI struggled most with Anatomy (identifying body parts) when images were blurry. However, strangely, they were slightly better at Treatment (guessing a drug) even when the image was bad, likely because they were just guessing based on text patterns rather than actually "seeing" the image.

🚀 Why This Matters

This paper is a wake-up call. We cannot just trust AI to diagnose patients based on perfect lab photos. Real hospitals are messy.

If we deploy these AI doctors today, they might look at a blurry scan, confidently tell a doctor the wrong diagnosis, and the doctor might trust them because the AI sounded so sure.

MedQ-Deg gives us the tools to:

Find the weak spots in current AI.
Build better AI that knows when it's confused and says, "I can't see this clearly, please ask a human."
Save lives by ensuring AI is not just smart, but also humble and reliable in the messy reality of the real world.

In short: We need AI that admits when it's blind, not AI that confidently walks off a cliff.

Here is a detailed technical summary of the paper "MedQ-Deg: A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations."

1. Problem Statement

While Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard medical benchmarks using high-quality images, their reliability in real-world clinical environments remains unproven. Clinical images frequently suffer from various quality degradations due to noise, motion artifacts, hardware limitations, and low-dose scanning.

The paper identifies two critical gaps in existing evaluation frameworks:

Lack of Multidimensional Assessment: Existing benchmarks either focus on natural image corruptions (which do not capture medical-specific artifacts like MRI undersampling) or lack the scale and granularity to assess performance across diverse degradation types, severity levels, and clinical capabilities.
Absence of Confidence Calibration Analysis: Current evaluations rely solely on accuracy metrics, ignoring metacognitive reliability. Preliminary investigations suggest models suffer from the "AI Dunning-Kruger Effect": they maintain inappropriately high confidence even when their accuracy collapses due to image degradation, posing severe safety risks in clinical deployment.

2. Methodology: The MedQ-Deg Benchmark

The authors propose MedQ-Deg, a comprehensive benchmark designed to evaluate medical MLLMs under clinically realistic image degradations. The framework is built on a hierarchical structure and a rigorous data pipeline.

A. Hierarchical Framework

The benchmark organizes evaluation along two orthogonal axes:

Capability Hierarchy (30 Fine-Grained Skills):
- High-Level: Medical Perception vs. Clinical Reasoning.
- Mid-Level: 6 Tasks (Anatomical Recognition, Imaging Perception, Clinical Understanding, Basic Science, Diagnostic Reasoning, Treatment Reasoning).
- Fine-Grained: 30 specific clinical skills derived from top-tier benchmarks (GMAI-MMBench, OmniMedVQA, MedXpertQA).
Degradation Hierarchy (18 Types across 7 Modalities):
- Categories: Artifacts, Intensity Jitter, Resolution & Blur, Motion Interference, and Noise.
- Modalities: CT, MRI, X-ray, Ultrasound, Endoscopy, Dermoscopy, and Pathology.
- Severity Levels: Each degradation is applied at three expert-calibrated degrees:
  - L0: Clean image.
  - L1: Diagnostic features remain intact.
  - L2: Diagnosis is challenging but feasible.

B. Dataset Construction

Source: 4,306 unique VQA pairs from established benchmarks were deduplicated.
Synthesis: Synthetic degradations were applied using tools like TorchIO and SciPy.
Expert Validation: Three board-certified radiologists filtered the data to ensure:
1. Diagnostic features were not completely obliterated.
2. Questions could not be answered via text cues alone (requiring visual reasoning).
Final Scale: The benchmark contains 24,894 QA pairs.

C. Evaluation Metrics

Actual Performance: Measured via standard accuracy on multiple-choice questions.
Perceived Confidence: Quantified using prediction consistency (voting-based) and normalized entropy.
Calibration Shift ( $\Delta_{calib}$ ): A novel metric defined as the gap between perceived confidence and actual accuracy ( $\Delta_{calib} = \text{Confidence} - \text{Accuracy}$ $Δ_{c a l ib} = Confidence - Accuracy$ ).
- Intra-Model DKE: Accuracy drops while confidence remains high or increases as severity rises.
- Inter-Model DKE: Lower-performing models exhibit higher calibration shifts (more overconfidence) than high-performing models.

3. Key Contributions

Systematic Benchmark: The first large-scale, multidimensional benchmark for medical MLLMs covering 18 degradation types, 7 modalities, and 30 capability dimensions with expert-calibrated severity levels.
Quantification of the AI Dunning-Kruger Effect: Introduction of the Calibration Shift metric, providing empirical evidence that medical MLLMs systematically overestimate their competence as image quality degrades.
Comprehensive Evaluation: Extensive testing of 40 mainstream MLLMs (Commercial, Open-Source General, and Medical-Specialized), revealing nuanced behavioral patterns across different dimensions.

4. Key Results & Findings

The evaluation of 40 models yielded four critical conclusions:

Nonlinear Robustness Deficiency: Models generally tolerate mild degradation (L0 $\to$ L1) but experience a "cliff effect" where performance collapses catastrophically at severe degradation (L1 $\to$ L2). Even top-tier models (e.g., InternVL3-78B) show significant accuracy drops at L2.
Capability Fragility:
- Clinical Reasoning (especially Treatment Planning) is the weakest area, with many open-source models collapsing to near-zero accuracy.
- Anatomical Recognition shows the poorest robustness to degradation, contrary to the intuition that perception tasks might be more stable.
- Medical-Specialized Models do not consistently outperform general-purpose models in robustness.
Degradation Sensitivity: Models are significantly more vulnerable to physics-based artifacts (e.g., MRI undersampling, sparse-view CT) and motion interference than to intensity jitter or noise. This suggests a lack of understanding of domain-specific imaging physics.
Universal Metacognitive Failure (AI Dunning-Kruger Effect):
- All models exhibit positive Calibration Shift, meaning they remain overconfident even as accuracy plummets.
- As degradation severity increases, the gap between confidence and accuracy widens systematically.
- Lower-performing models are disproportionately more overconfident than high-performing ones.

5. Significance and Validation

Real-World Alignment: The authors validated that their synthetic degradations align with real-world clinical data using t-SNE feature analysis (showing overlapping distributions) and rank consistency studies (model rankings on simulated data faithfully predict rankings on real-world degraded data).
Clinical Implications: The discovery of the AI Dunning-Kruger Effect highlights a critical safety barrier. Overconfident erroneous inferences may prevent necessary human oversight, leading to dangerous clinical decisions.
Future Direction: MedQ-Deg serves as essential infrastructure for developing medical MLLMs that are not only accurate on clean data but also robust and well-calibrated under the imperfect conditions of real clinical practice. It calls for future research focused on improving model self-awareness and calibration under stress.

MedQ-Deg: A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations

🏥 The Problem: The "AI Dunning-Kruger Effect"

🛠️ The Solution: MedQ-Deg (The Stress Test)

🔍 What They Found (The Results)

🚀 Why This Matters

1. Problem Statement

2. Methodology: The MedQ-Deg Benchmark

A. Hierarchical Framework

B. Dataset Construction

C. Evaluation Metrics

3. Key Contributions

4. Key Results & Findings

5. Significance and Validation

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers