Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

Imagine you have a brilliant, super-fast medical student who has read every textbook in the world. This student can look at an X-ray or an MRI and tell you exactly what's wrong. But there's a catch: this student is a compulsive liar when they aren't 100% sure.

When they are confident, they are right. But when they are guessing, they will confidently invent a diagnosis that sounds perfect but is completely made up. In the AI world, this is called a "hallucination."

This paper is about teaching a "lie detector" to catch this student before they give you a wrong answer.

The Problem: The Confident Liar

Radiologists (the doctors who read scans) are overworked. They want to use AI to help them. But current AI models (like GPT-4o) are like that medical student: they can be amazing, but they also make up facts.

The scary part? The AI doesn't say, "I'm not sure." It says, "It's definitely a broken bone," even if it's just a shadow. If a doctor trusts this wrong answer, a patient could get the wrong treatment.

The Solution: The "Group Chat" Test

The researchers came up with a clever trick called Discrete Semantic Entropy (DSE). Think of it like asking the same question to the AI 15 times in a row, but with a slight twist: they tell the AI to be a little more "random" or "creative" each time.

Here is the analogy:
Imagine you ask your friend, "What did I have for breakfast?"

Scenario A (The Truth): You ask them 15 times. They say "Toast" 15 times. They are consistent.
Scenario B (The Lie/Guess): You ask them 15 times. They say "Toast" 5 times, "Eggs" 4 times, "Cereal" 3 times, and "I don't know" 3 times. They are all over the place.

The researchers realized that when the AI is unsure, its answers will scatter like a flock of birds. When it is sure, the answers will stay in a tight cluster.

How They Measured It

They used a math concept called Entropy (which is just a fancy word for "chaos" or "disorder").

Low Entropy: The AI gave the same answer (or very similar answers) every time. Result: Trust the answer.
High Entropy: The AI gave 15 different answers. Result: The AI is hallucinating. Reject the answer.

What Happened When They Tried It?

The researchers tested this on two huge sets of medical images and questions.

The Baseline: Without the filter, the AI was only right about 52% of the time. It was basically flipping a coin, but with a lot of confidence.
The Filter: They told the AI: "If your answers are messy (high entropy), don't give me an answer at all."
The Result:
- They threw away about half the questions because the AI was too confused.
- But for the questions they did answer, the accuracy jumped to 76%.

The Trade-off: It's like a security guard at a club. If the guard is strict (high filter), they let fewer people in, but almost everyone inside is a VIP. If the guard is lazy, everyone gets in, but there are a lot of troublemakers. The researchers found that by being strict, they got rid of the "troublemakers" (wrong answers) and kept the "VIPs" (correct answers).

The Catch (The "Confident Liar" Problem)

The paper admits this isn't a magic wand.

The Problem: If the AI is consistently lying (e.g., it confidently says "It's a broken bone" 15 times in a row), the filter won't catch it. The answers are consistent, so the "chaos meter" stays low, but the answer is still wrong.
The Reality: This method catches the AI when it is confused, but it can't catch the AI when it is confidently wrong.

Why This Matters

This is a huge step forward because it works on "Black Box" AI. You don't need to know how the AI's brain works inside; you just look at what it says. It's like checking a student's work by asking them to explain it five different ways. If they stumble, you know they don't really know the answer.

In short: This paper teaches us how to make AI doctors safer. It doesn't make them perfect, but it gives us a way to say, "Hey, this AI is guessing. Let's not trust this answer and ask a human doctor instead." It turns a risky, confident liar into a cautious, helpful assistant that knows when to stay silent.

1. Problem Statement

The integration of Vision-Language Models (VLMs) into radiology offers potential for workload reduction but is hindered by the propensity of these models to generate hallucinations—plausible-sounding but factually incorrect outputs not grounded in visual evidence.

The Challenge: Unlike human experts who express uncertainty, VLMs often present erroneous findings with high linguistic confidence.
The Limitation of Current Solutions:
- Internal Mechanisms: Proprietary "black-box" models (e.g., GPT-4o) do not expose internal token probabilities or activation states, making standard uncertainty estimation impossible.
- Auxiliary Training: Methods requiring reward models or fine-tuning are often impractical for clinical practitioners due to data and compute constraints.
- Existing Consistency Checks: Current black-box methods (e.g., rephrasing inputs or multi-report generation) can be computationally expensive or alter clinical nuance.

The authors aim to determine if Discrete Semantic Entropy (DSE) can effectively detect and filter hallucinations in radiologic Visual Question Answering (VQA) tasks without requiring access to model internals.

2. Methodology

Study Design & Datasets

The study is a retrospective evaluation using two de-identified, publicly available datasets:

VQA-Med 2019: 500 radiological images with clinical questions categorized by modality, plane, organ, and abnormality.
RadDataset: 206 clinical cases (60 CT, 60 MRI, 60 Radiography, 26 Angiography) with ground-truth diagnoses confirmed by consensus.

Models Evaluated

GPT-4o (v2024-05-13)
GPT-4.1 (v2025-04-14)
Both accessed via Microsoft Azure API.

The DSE Workflow

The core innovation is a black-box uncertainty quantification pipeline:

High-Temperature Sampling: For each image-question pair, the model generates 15 independent responses at a high temperature ( $T=1.0$ ) to induce variability and reveal uncertainty.
Baseline Generation: One response is generated at low temperature ( $T=0.1$ ) to establish baseline accuracy.
Semantic Clustering: The 15 responses are grouped into semantic clusters using bidirectional entailment checks (performed by the same VLM). Responses are clustered only if they mutually entail each other. This mitigates entropy inflation caused by mere linguistic paraphrasing.
Entropy Calculation: Discrete Semantic Entropy (DSE) is calculated based on the relative frequencies of these clusters:
$DSE(x) = -\sum P(C_i|x) \log_{10} P(C_i|x)$
- Low DSE: High semantic consistency (low uncertainty).
- High DSE: High semantic dispersion (high uncertainty/potential hallucination).
Selective Prediction: Questions with DSE scores exceeding specific thresholds (0.6 and 0.3) are rejected. Accuracy is recalculated only on the retained (filtered) subset.

Evaluation Metrics

Accuracy: Percentage of answers semantically equivalent to the ground truth.
Statistical Significance: Assessed via 100,000-iteration bootstrap resampling with Bonferroni correction ( $p < 0.004$ ).
Trade-off Analysis: Accuracy gain vs. coverage reduction (number of rejected questions).

3. Key Results

Baseline Performance

Overall Baseline: Low accuracy for complex medical tasks.
- GPT-4o: 51.7%
- GPT-4.1: 54.8%
Dataset Disparity: Performance was significantly lower on the clinical RadDataset (~~34%) compared to the benchmark VQA-Med (~~60%), highlighting the difficulty of real-world diagnostic interpretation.
Subcategory Struggle: Both models struggled severely with "Abnormality" detection (baseline ~13%).

Impact of DSE Filtering

Filtering out high-entropy questions significantly improved the accuracy of the remaining answers:

At Threshold DSE ≤ 0.6:
- GPT-4o accuracy rose to 62.9% (retained 70% of questions).
- GPT-4.1 accuracy rose to 60.4%.
At Threshold DSE ≤ 0.3 (Stricter):
- GPT-4o accuracy surged to 76.3% (retained 47% of questions).
- GPT-4.1 accuracy rose to 63.8% (retained 71% of questions).
Statistical Significance: All improvements were statistically significant ( $p < 0.001$ ) except for GPT-4o on the RadDataset at the strictest threshold (due to small sample size, $n=29$ ).

Trade-offs and Subcategories

Coverage vs. Accuracy: Stricter thresholds (lower DSE) yield higher accuracy but reject more questions.
Modality Differences:
- Abnormality Detection: Showed the highest rejection rates (up to 91% at DSE ≤ 0.3) but also the highest relative accuracy gains.
- Modality/Plane: High baseline accuracy resulted in low rejection rates and marginal gains.
- RadDataset: GPT-4o showed a decline in accuracy for CT scans at the strictest threshold, suggesting the filter may have removed "lucky" correct guesses or that the model's uncertainty distribution differs by modality.

4. Key Contributions

Black-Box Applicability: Demonstrated that DSE works effectively on proprietary models (GPT-4o/4.1) without access to internal weights, token probabilities, or fine-tuning.
Semantic Clustering Innovation: Refined the entropy calculation by using bidirectional entailment to group semantically equivalent answers, ensuring entropy reflects true uncertainty rather than linguistic variation.
Quantifiable Uncertainty in Radiology: Provided empirical evidence that semantic inconsistency is a strong proxy for hallucination in medical VQA, allowing for the creation of a "safety filter."
Feasibility Analysis:
- Latency: The pipeline is parallelizable, adding only ~~2x the latency of a single API call (~~6 seconds total).
- Cost: Estimated at $0.72 per question, deemed financially feasible for clinical integration.

5. Significance and Limitations

Significance

Safety Mechanism: DSE offers a practical strategy to deploy VLMs in clinical settings by automatically suppressing low-confidence (high-hallucination-risk) outputs.
Trust Building: By attaching an interpretable uncertainty score to AI outputs, it allows radiologists to make informed decisions about when to trust the AI.
Proof of Concept: Validates that current general-purpose VLMs are not yet ready for autonomous diagnosis but can be made significantly more reliable through selective prediction.

Limitations

Confident Hallucinations: DSE measures consistency, not factual correctness. If a model consistently generates the same wrong answer (a "confident hallucination"), DSE will be low, and the error will pass the filter.
Clustering Reliance: The semantic clustering step relies on the VLM's own entailment capabilities, which may introduce bias for complex clinical nuances.
2D Limitation: The study used 2D slices (key slices) rather than full 3D volumetric data, which may not fully capture the complexity of real-world 3D interpretation.
Threshold Calibration: The optimal DSE threshold is model-dependent and requires calibration for specific clinical use cases (balancing sensitivity vs. specificity).

Conclusion

The paper concludes that Discrete Semantic Entropy is a robust, low-cost, and deployable method for filtering hallucinations in black-box VLMs for radiology. While it does not solve the fundamental issue of factual correctness, it significantly enhances the reliability of accepted answers (up to 76.3% accuracy) by identifying and rejecting semantically inconsistent queries, representing a critical step toward safe clinical AI integration.