Quantifying Hallucinations in Language Language Models on Medical Textbooks

This paper quantifies hallucinations in medical textbook-based QA by demonstrating that LLaMA-70B-Instruct hallucinated in nearly 20% of answers despite high plausibility, and found that lower hallucination rates generally correlate with higher clinician-rated usefulness across models.

Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Imagine you have a brilliant, well-read student named LLaMA. This student has read almost every book in the library and can answer questions with perfect grammar and a very confident tone. However, there's a catch: sometimes, when the student doesn't know the answer, instead of saying "I don't know," they make up a story that sounds completely true. In the world of AI, we call this hallucination.

This paper is like a report card for medical students (AI models) who are trying to answer questions based only on specific textbooks provided to them, rather than relying on their memory of things they might have read online before.

Here is the breakdown of what the researchers did, using some everyday analogies:

1. The Problem: The "Confident Liar"

In the past, we tested medical AI by asking them multiple-choice questions (like a board exam). The AI got great scores. But the researchers realized this was like testing a student who had memorized the answer key rather than actually understanding the material. The AI might be "cheating" by remembering the question from its training data, not by reasoning through the problem.

Worse, even when the AI gets the answer right, it might add a made-up detail that sounds scary or dangerous.

  • The Analogy: Imagine a tour guide who knows the history of a museum perfectly. But sometimes, when asked about a specific painting, they confidently invent a fake backstory about the artist's secret life. If you are a doctor relying on that guide, that fake story could be dangerous.

2. The Solution: The "Textbook Trap"

To catch these "confident liars," the researchers built a special testing ground called NAMEANONYMIZED (a pipeline).

  • How it works: They took public medical textbooks, cut them into small paragraphs, and asked the AI to generate questions and answers based strictly on those paragraphs.
  • The Trap: If the AI adds any information that isn't in that specific paragraph, it's caught lying. It's like a "closed-book" test where the student is only allowed to use the one page of notes in front of them.

3. The Experiments: Two Rounds of Testing

Experiment 1: The Baseline Check
They tested one famous open-source model (LLaMA-70B) on 5,543 questions.

  • The Result: The AI hallucinated in 19.7% of the answers. That's roughly 1 in 5 times.
  • The Scary Part: Even though the AI was lying, 98.8% of the time, the answer sounded perfectly plausible. It used the right medical words and sounded professional.
  • The Takeaway: You cannot trust an AI just because it sounds smart. It can be a smooth-talking liar 20% of the time.

Experiment 2: The Race and The Doctor's Vote
They tested 8 different AI models of various sizes and asked real doctors to grade them.

  • The Size Matters: Bigger models (with more "brain power") hallucinated less. A small model lied about 27% of the time, while a giant model lied only 9% of the time.
  • The Doctor's Verdict: Doctors preferred the models that lied less. There was a strong link: the fewer lies a model told, the more useful the doctors found it.
  • The "Tricky Question" Effect: The researchers found that asking questions in a tricky way (like "Which drug is NOT safe?" instead of "Which drug is safe?") made the AI lie much more often. It's like how people trip over negative words in a sentence; the AI gets confused and makes things up.

4. The Cost of Truth: The Human Price Tag

The paper highlights a major bottleneck: Verification.

  • The Analogy: You can use a robot to write a million medical answers in seconds for pennies. But to check if those answers are true, you need a human doctor to read every single one.
  • The Reality: The cost of the human doctor checking the work is 10 times higher than the cost of the robot doing the writing.
  • The Conclusion: Until we can build a robot that can check the work of another robot as well as a human doctor can, we cannot fully trust AI in medicine. We need a human "editor" in the loop, and that is expensive and slow.

Summary: What Does This Mean for You?

  • AI is not ready for prime time in medicine: Even the best models still make up facts about 1 in 5 times.
  • Looks can be deceiving: Just because an AI answer sounds professional and uses big words doesn't mean it's true.
  • Bigger is better, but not perfect: Bigger models make fewer mistakes, but they still make them.
  • Human oversight is non-negotiable: We cannot just let AI run hospitals. We need humans to double-check everything, and that is the biggest hurdle to making this technology safe and affordable.

In short, the paper is a warning: Don't let the AI drive the car yet. It has a great voice and knows the map, but it still occasionally invents new roads that don't exist.