Aggregate benchmark scores obscure patient safety… — Plain-Language Explanation

Original authors: Linzmayer, R., Ramaswamy, A., Hugo, H., Nadkarni, G., Elhadad, N.

Published 2026-03-20

📖 4 min read☕ Coffee break read

Original authors: Linzmayer, R., Ramaswamy, A., Hugo, H., Nadkarni, G., Elhadad, N.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are buying a car. The salesperson shows you a brochure with a big, shiny star rating: "95% Safe!" You feel confident and buy it. But later, you discover that while the car rarely crashes, when it does fail, it has a weird habit of driving straight off a cliff instead of just hitting a tree.

That is exactly what this paper is warning us about regarding AI chatbots used for health advice.

Here is the breakdown of the research in simple terms:

1. The "Average Score" Trap

Right now, when companies release new AI models (like GPT-5 or Claude), they give us a single "benchmark score" to say how good they are. It's like saying a doctor is "85% accurate."

The researchers say this is dangerous. In medicine, direction matters more than the average.

Under-triage (The Cliff): The AI says, "You're fine, go home," when you actually need the ER immediately. This is deadly.
Over-triage (The Tree): The AI says, "Go to the ER immediately," when you just need a band-aid. This is annoying and wastes money, but it's not usually fatal.

The paper found that two AI models could both have an "85% accuracy" score, but one might be a "safe" model that mostly sends people to the ER unnecessarily, while the other is a "dangerous" model that misses life-threatening emergencies. The average score hides the difference between a nuisance and a tragedy.

2. The "Family Member" Effect

The researchers tested the AI with a specific trick: they asked the AI to imagine a scenario where a friend or family member said, "Oh, don't worry, it's probably nothing."

The Result: Almost every AI model got "swept up" by this reassurance. Even when the symptoms were scary, if a "friend" minimized them, the AI was much more likely to say, "Yeah, you're probably fine," and send the patient home. It's like a doctor who stops listening to your pain because your friend says, "He's just being dramatic."

3. The "Suicide" Silence

The team also tested how the AI handled questions about suicide.

The Problem: When people asked about suicide, the AI often forgot to give the emergency hotline number (like 988 in the US).
The Surprise: It didn't matter if the person sounded very specific about their plan or just vague. The AI was inconsistent. Sometimes it gave the number; sometimes it just gave generic advice. It's like a lifeguard who sometimes throws a rope to a drowning swimmer and sometimes just yells, "Try to swim harder!"

4. The "Newer Isn't Always Better" Rule

You might think, "Well, the newest AI model must be the safest."
The researchers tested the very latest models (like GPT-5.4) against older ones.

The Shock: The newest model actually made more mistakes in missing emergencies than a slightly older version.
The Lesson: Just because a model is "newer" or has a higher "overall score" doesn't mean it's safer. In fact, it might be worse at spotting the specific things that kill people.

The Big Takeaway

The paper argues that we need to stop looking at the single "Star Rating" for AI health tools.

Instead, we need to look at the detailed report card:

How often does it miss a heart attack?
How often does it panic over a headache?
Does it get confused if a friend says "it's nothing"?

The Analogy:
Imagine you are hiring a security guard for a bank.

Current Method: You ask, "How often do you catch thieves?" The guard says, "99% of the time!" You hire him.
The Reality: You find out that the 1% he misses are the ones stealing the entire vault, and he lets them walk out the front door smiling. Meanwhile, he stops 100 innocent people a day just to check their IDs (Over-triage).

The "99% score" looks great, but the direction of his errors makes him a terrible guard.

Conclusion: We cannot trust AI with our health just because it has a high score. We need to know exactly how it fails before we let it decide if we need to go to the hospital.

1. Problem Statement

Frontier Large Language Models (LLMs) are increasingly used as de facto first points of contact for health-related queries, with millions of users relying on them for symptom checking and triage. However, current safety evaluations rely heavily on aggregate benchmark scores (e.g., overall accuracy). The authors argue that these metrics are insufficient for clinical safety because they mask the directionality of errors.

Clinical Asymmetry: In triage, missing a true emergency (under-triage) has fundamentally different and potentially fatal consequences compared to recommending unnecessary care (over-triage).
The Gap: A model can achieve high aggregate accuracy while exhibiting dangerous, systematic biases (e.g., consistently under-triaging specific demographics or crisis scenarios) that aggregate scores fail to reveal.

2. Methodology

The study applied a rigorous, factorial experimental design to evaluate nine widely deployed general-purpose AI models (including GPT-5 variants, Claude, Gemini, DeepSeek, and Llama) alongside the previously published GPT-Health results.

Dataset: The study utilized the Nature Medicine triage benchmark (Ramaswamy et al.), consisting of 960 structured clinical vignettes.
- Acuity Levels: Four levels: Home (A), Routine (B), Urgent (C), and ED Now (D).
- Edge Cases: 480 vignettes with dual gold standards (spanning adjacent acuity levels) were analyzed separately.
- Non-Edge Cases: 480 vignettes with a single gold standard were used for directional error analysis.
Experimental Variables: Each vignette was systematically varied across four binary factors:
1. Anchoring: Presence of a companion minimizing symptoms (false reassurance) or exaggerating them.
2. Access Barriers: Lack of insurance or after-hours presentation.
3. Demographics: Patient race (Black vs. White) and sex (Woman vs. Man).
Evaluation Protocol:
- Models were queried via API with identical prompts.
- Sampling: 10 independent samples per vignette (except for reasoning models queried without sampling parameters); the modal response was used as the final answer.
- Metrics:
  - Directional Error Rates: Under-triage (recommendation < gold standard) vs. Over-triage (recommendation > gold standard).
  - Contextual Bias: Mixed-effects logistic regression to test associations between predictors (e.g., anchoring) and triage shifts.
  - Crisis Safety: Analysis of crisis resource mention rates (e.g., 988 hotline) in suicidality vignettes.

3. Key Contributions

Decoupling Accuracy from Safety: Demonstrated that aggregate accuracy is uncorrelated with under-triage rates, proving that high accuracy scores do not guarantee clinical safety.
Directional Heterogeneity: Revealed that models exhibit vastly different error profiles; some are prone to dangerous under-triage, while others are prone to resource-wasting over-triage.
Contextual Vulnerability: Identified that symptom minimization by companions is a consistent, significant predictor of under-triage across all models, a bias that persists regardless of model family or provider.
Evaluation Framework Critique: Argued that current benchmarks (e.g., HealthBench, MedCalc-Bench) are inadequate for clinical deployment because they lack stratification by error direction and clinical severity.

4. Key Results

A. Aggregate Accuracy vs. Safety

In-range accuracy (recommendations within the gold-standard window) ranged from 75.0% (Llama-3.3-70B) to 87.7% (GPT-5-mini).
Under-triage vs. Over-triage:
- Under-triage ranged from 0.0% (GPT-5.2) to 12.3% (GPT-5-mini).
- Over-triage ranged from 9.4% (GPT-5-mini) to 36.9% (Gemini-2.5-Pro).
- Crucial Finding: Under-triage was uncorrelated with aggregate accuracy ( $\rho = -0.05, p=0.881$ ). A model with high accuracy could still have high under-triage rates.

B. Critical Failure in Emergency Scenarios (ED Now)

In the highest acuity category ("ED Now," requiring immediate emergency evaluation), under-triage rates varied wildly:
- GPT-5.2 & Gemini-2.5-Pro: 0% under-triage (0/64).
- GPT-5-mini: 75% under-triage (48/64).
- GPT-5.4-Thinking: 8% under-triage (5/64).
Regression to the Mean: GPT-5.4 showed a statistically significant increase in under-triage compared to GPT-5.2, indicating that newer model releases do not guarantee monotonic safety improvements.

C. Contextual and Demographic Bias

Anchoring Effect: When a companion minimized symptoms, models were 3–15 times more likely to downgrade an edge case (OR range 2.9–14.9). This was the only predictor significant across all models.
Access Barriers: Six of ten models were more likely to downgrade edge cases when access barriers were present.
Demographics: Patient race and sex were not significant predictors of triage errors in any model tested.

D. Crisis Resource Provision

In suicidality vignettes, crisis resource mention rates were low and highly variable across models (median 31.2% with findings, 25.0% without).
There was no consistent pattern regarding whether objective findings (specific plan) increased the likelihood of a crisis referral mention.

5. Significance and Implications

Safety Cannot Be Inferred from Benchmarks: The study concludes that "health branding" or high aggregate scores are not reliable proxies for harm-relevant behavior. A model optimized for general accuracy may be dangerously biased in specific clinical contexts.
Need for Directional Reporting: The authors advocate for a new standard in AI evaluation where directional error rates (under-triage vs. over-triage) stratified by clinical severity and context are reported.
Deployment Risks: The observed heterogeneity suggests that deploying models without understanding their specific error profiles could lead to systematic patient harm, particularly for vulnerable populations facing access barriers or those whose symptoms are downplayed by others.
Future Work: Evaluation frameworks must adopt factorial designs that systematically vary clinical context to detect patterned shifts toward harmful directions, rather than relying on static aggregate metrics.

Conclusion: The paper serves as a critical warning that the current "accuracy-first" paradigm in LLM evaluation is dangerously insufficient for high-stakes medical applications. Safety requires a granular understanding of how and in what direction models fail.

Aggregate benchmark scores obscure patient safety implications of errors across frontier language models