Quantifying Hallucinations in Language Language Models on Medical Textbooks

Imagine you have a brilliant, well-read student named LLaMA. This student has read almost every book in the library and can answer questions with perfect grammar and a very confident tone. However, there's a catch: sometimes, when the student doesn't know the answer, instead of saying "I don't know," they make up a story that sounds completely true. In the world of AI, we call this hallucination.

This paper is like a report card for medical students (AI models) who are trying to answer questions based only on specific textbooks provided to them, rather than relying on their memory of things they might have read online before.

Here is the breakdown of what the researchers did, using some everyday analogies:

1. The Problem: The "Confident Liar"

In the past, we tested medical AI by asking them multiple-choice questions (like a board exam). The AI got great scores. But the researchers realized this was like testing a student who had memorized the answer key rather than actually understanding the material. The AI might be "cheating" by remembering the question from its training data, not by reasoning through the problem.

Worse, even when the AI gets the answer right, it might add a made-up detail that sounds scary or dangerous.

The Analogy: Imagine a tour guide who knows the history of a museum perfectly. But sometimes, when asked about a specific painting, they confidently invent a fake backstory about the artist's secret life. If you are a doctor relying on that guide, that fake story could be dangerous.

2. The Solution: The "Textbook Trap"

To catch these "confident liars," the researchers built a special testing ground called NAMEANONYMIZED (a pipeline).

How it works: They took public medical textbooks, cut them into small paragraphs, and asked the AI to generate questions and answers based strictly on those paragraphs.
The Trap: If the AI adds any information that isn't in that specific paragraph, it's caught lying. It's like a "closed-book" test where the student is only allowed to use the one page of notes in front of them.

3. The Experiments: Two Rounds of Testing

Experiment 1: The Baseline Check
They tested one famous open-source model (LLaMA-70B) on 5,543 questions.

The Result: The AI hallucinated in 19.7% of the answers. That's roughly 1 in 5 times.
The Scary Part: Even though the AI was lying, 98.8% of the time, the answer sounded perfectly plausible. It used the right medical words and sounded professional.
The Takeaway: You cannot trust an AI just because it sounds smart. It can be a smooth-talking liar 20% of the time.

Experiment 2: The Race and The Doctor's Vote
They tested 8 different AI models of various sizes and asked real doctors to grade them.

The Size Matters: Bigger models (with more "brain power") hallucinated less. A small model lied about 27% of the time, while a giant model lied only 9% of the time.
The Doctor's Verdict: Doctors preferred the models that lied less. There was a strong link: the fewer lies a model told, the more useful the doctors found it.
The "Tricky Question" Effect: The researchers found that asking questions in a tricky way (like "Which drug is NOT safe?" instead of "Which drug is safe?") made the AI lie much more often. It's like how people trip over negative words in a sentence; the AI gets confused and makes things up.

4. The Cost of Truth: The Human Price Tag

The paper highlights a major bottleneck: Verification.

The Analogy: You can use a robot to write a million medical answers in seconds for pennies. But to check if those answers are true, you need a human doctor to read every single one.
The Reality: The cost of the human doctor checking the work is 10 times higher than the cost of the robot doing the writing.
The Conclusion: Until we can build a robot that can check the work of another robot as well as a human doctor can, we cannot fully trust AI in medicine. We need a human "editor" in the loop, and that is expensive and slow.

Summary: What Does This Mean for You?

AI is not ready for prime time in medicine: Even the best models still make up facts about 1 in 5 times.
Looks can be deceiving: Just because an AI answer sounds professional and uses big words doesn't mean it's true.
Bigger is better, but not perfect: Bigger models make fewer mistakes, but they still make them.
Human oversight is non-negotiable: We cannot just let AI run hospitals. We need humans to double-check everything, and that is the biggest hurdle to making this technology safe and affordable.

In short, the paper is a warning: Don't let the AI drive the car yet. It has a great voice and knows the map, but it still occasionally invents new roads that don't exist.

Here is a detailed technical summary of the paper "Quantifying Hallucinations in Language Models on Medical Textbooks."

1. Problem Statement

Large Language Models (LLMs) have achieved high scores on established medical benchmarks (e.g., MedQA, MedMCQA), yet these scores often reflect memorization of training data rather than genuine evidence-based reasoning. A critical safety failure mode remains: hallucinations (factually incorrect or unsupported claims).

Limitations of Current Benchmarks: Existing benchmarks often rely on multiple-choice questions derived from licensing exams or static datasets that may be contaminated by training data. They frequently fail to penalize models for generating fluent but factually unsupported answers.
The Gap: There is a lack of contamination-resistant, text-grounded evaluation methods that force models to reason strictly from an authoritative source and penalize any deviation or unsupported addition.

2. Methodology

The authors introduce NAMEANONYMIZED (a pipeline name anonymized in the text), a system designed to generate a contamination-resistant benchmark grounded in public-domain medical textbooks.

A. Corpus and Prompt Construction

Source: Public-domain medical textbooks.
Filtering: Paragraphs shorter than 15 tokens or containing boilerplate text were discarded. Passages were selected based on heuristics: named-entity density ( $\rho \ge 0.01$ ), semantic coherence, semantic entropy, and absence of structural noise.
QA Generation: An LLM (LLaMA-70B-Instruct) generated diverse Question-Answer (QA) pairs from these passages using seven templates:
1. True/False
2. List
3. Multiple Choice
4. Short Answer
5. Multi-hop Reasoning
6. Short Inverse (negated prompt)
7. Multi-hop Inverse (negated prompt)
Output: 5,543 structurally valid QA items.

B. Experimental Design

The study consists of two distinct experiments:

Experiment 1: Baseline Hallucination Detection (Single Model)
- Model: LLaMA-70B-Instruct.
- Task: Generate answers to the 5,543 novel prompts.
- Verification: Expert medical annotators (via a secure portal) verified answers against the source text.
- Metrics: Plausibility (relevance to medical literature), Answerability, and Hallucination Rate (content not supported by the source).
- Reliability: Inter-annotator agreement was measured using quadratic-weighted Cohen's $\kappa$ and Fleiss' $\kappa$ .
Experiment 2: Cross-Model Comparison and Clinician Preference
- Models: Eight models (including Phi-4, LLaMA 3.3/4 variants, Qwen 3, Mistral, Falcon, and two internal models).
- Task: Zero-shot generation on the verified benchmark.
- Human Evaluation: Clinicians ranked model outputs (Best-to-Worst) and assigned quality tags (Bad/Okay/Good). They also labeled unsupported claims.
- Analysis: Correlation between hallucination rates and clinician preference/usefulness scores.

C. Metrics

Hallucination Rate: Proportion of answers containing plausible but non-factual content or fabrications.
Plausibility: Annotator rating of relevance (1–5 scale).
Agreement: Kendall's $\tau_b$ for ranking correlation; Quadratic-weighted $\kappa$ for quality tags.
Cost Analysis: Calculation of GPU costs and human verification time per QA pair.

3. Key Contributions

Novel Benchmark: Creation of a text-grounded benchmark derived from public-domain textbooks, ensuring every evaluation item is linked to an authoritative source paragraph to measure hallucinations directly.
Quantification of "Plausible" Hallucinations: Demonstrated that models can generate responses with high plausibility (98.8% rated as highly relevant) while still containing factual errors (hallucinations).
Stress-Testing via Inversion: Identified that inverse prompts (e.g., asking what is not safe) and list-based questions are the most effective modalities for eliciting hallucinations, outperforming standard forward prompts.
Cost-Benefit Analysis: Highlighted that human expert verification is the dominant cost factor (over 10x the cost of GPU inference), making fully automated deployment in high-risk medical scenarios currently unfeasible without human oversight.

4. Key Results

Experiment 1: Baseline Performance (LLaMA-70B-Instruct)

Hallucination Rate: 19.7% (95% CI: 18.6% – 20.7%).
Plausibility: 98.8% of responses received maximal plausibility scores, indicating that hallucinations are often indistinguishable from correct answers based on tone and grammar alone.
Answerability: 80.2% of answers effectively addressed the question.
Agreement: High inter-annotator agreement ( $\kappa \approx 0.92$ for validity, $\kappa \approx 0.94$ for relevance).

Experiment 2: Cross-Model Analysis

Model Scale vs. Hallucination: A clear negative correlation exists between model size and hallucination frequency.
- 1B parameter models: ~27.1% hallucination rate.
- 70B parameter models: ~9.3% hallucination rate.
Clinician Preference: Lower hallucination rates correlated with higher clinician usefulness scores ( $\rho = -0.71, p = 0.058$ ).
Impact of Prompt Type:
- Inverse Prompts: Elicited 6–9% more hallucinations than forward prompts.
- List Questions: Triggered the most errors regarding completeness (omitting true contraindications or adding unsupported ones).
Universal Failure: Despite improvements in larger models, every model tested produced some "Bad" answers and hallucinations. No model achieved perfect performance.

Cost Analysis

Human verification costs significantly outweigh computational costs.
Task 1 (Verification) cost: ~$0.68 per QA.
Task 2 (Ranking) cost: ~$2.65 per QA.
Total human cost per item is roughly $3.33, compared to negligible GPU costs for smaller models.

5. Significance and Conclusion

Safety Criticality: The study concludes that current LLMs are unfit for unsupervised clinical deployment. Even the best-performing models hallucinate in nearly 1 in 10 responses, and the "plausibility" of these errors makes them dangerous for end-users.
Evaluation Shift: Traditional benchmarks measuring accuracy on multiple-choice questions are insufficient. Evaluation must shift toward text-grounded reasoning and specific stress tests (inverse prompts, list completeness) to expose failure modes.
Human-in-the-Loop Necessity: Because automated generation pipelines cannot guarantee factual correctness without expert oversight, and because expert verification is expensive, the deployment of LLMs in medicine requires effective human-expert oversight. The cost of this oversight is currently the primary barrier to safe, autonomous medical AI.

Final Takeaway: While LLMs are improving in scale and reducing hallucination rates, they remain prone to generating highly plausible but factually incorrect medical advice. Robust, text-grounded evaluation reveals that no current model is safe for autonomous use in high-risk medical scenarios.