Aggregate benchmark scores obscure patient safety implications of errors across frontier language models

This study demonstrates that aggregate benchmark scores fail to capture critical patient safety risks in frontier language models for healthcare, as significant and unpredictable variations in error directionality, contextual bias, and crisis response across models reveal that overall accuracy alone cannot predict clinical safety.

Linzmayer, R., Ramaswamy, A., Hugo, H., Nadkarni, G., Elhadad, N.

Published 2026-03-20
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are buying a car. The salesperson shows you a brochure with a big, shiny star rating: "95% Safe!" You feel confident and buy it. But later, you discover that while the car rarely crashes, when it does fail, it has a weird habit of driving straight off a cliff instead of just hitting a tree.

That is exactly what this paper is warning us about regarding AI chatbots used for health advice.

Here is the breakdown of the research in simple terms:

1. The "Average Score" Trap

Right now, when companies release new AI models (like GPT-5 or Claude), they give us a single "benchmark score" to say how good they are. It's like saying a doctor is "85% accurate."

The researchers say this is dangerous. In medicine, direction matters more than the average.

  • Under-triage (The Cliff): The AI says, "You're fine, go home," when you actually need the ER immediately. This is deadly.
  • Over-triage (The Tree): The AI says, "Go to the ER immediately," when you just need a band-aid. This is annoying and wastes money, but it's not usually fatal.

The paper found that two AI models could both have an "85% accuracy" score, but one might be a "safe" model that mostly sends people to the ER unnecessarily, while the other is a "dangerous" model that misses life-threatening emergencies. The average score hides the difference between a nuisance and a tragedy.

2. The "Family Member" Effect

The researchers tested the AI with a specific trick: they asked the AI to imagine a scenario where a friend or family member said, "Oh, don't worry, it's probably nothing."

The Result: Almost every AI model got "swept up" by this reassurance. Even when the symptoms were scary, if a "friend" minimized them, the AI was much more likely to say, "Yeah, you're probably fine," and send the patient home. It's like a doctor who stops listening to your pain because your friend says, "He's just being dramatic."

3. The "Suicide" Silence

The team also tested how the AI handled questions about suicide.

  • The Problem: When people asked about suicide, the AI often forgot to give the emergency hotline number (like 988 in the US).
  • The Surprise: It didn't matter if the person sounded very specific about their plan or just vague. The AI was inconsistent. Sometimes it gave the number; sometimes it just gave generic advice. It's like a lifeguard who sometimes throws a rope to a drowning swimmer and sometimes just yells, "Try to swim harder!"

4. The "Newer Isn't Always Better" Rule

You might think, "Well, the newest AI model must be the safest."
The researchers tested the very latest models (like GPT-5.4) against older ones.

  • The Shock: The newest model actually made more mistakes in missing emergencies than a slightly older version.
  • The Lesson: Just because a model is "newer" or has a higher "overall score" doesn't mean it's safer. In fact, it might be worse at spotting the specific things that kill people.

The Big Takeaway

The paper argues that we need to stop looking at the single "Star Rating" for AI health tools.

Instead, we need to look at the detailed report card:

  • How often does it miss a heart attack?
  • How often does it panic over a headache?
  • Does it get confused if a friend says "it's nothing"?

The Analogy:
Imagine you are hiring a security guard for a bank.

  • Current Method: You ask, "How often do you catch thieves?" The guard says, "99% of the time!" You hire him.
  • The Reality: You find out that the 1% he misses are the ones stealing the entire vault, and he lets them walk out the front door smiling. Meanwhile, he stops 100 innocent people a day just to check their IDs (Over-triage).

The "99% score" looks great, but the direction of his errors makes him a terrible guard.

Conclusion: We cannot trust AI with our health just because it has a high score. We need to know exactly how it fails before we let it decide if we need to go to the hospital.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →