Diagnostic Accuracy of Large Language Models for Rare… — Plain-Language Explanation

Original authors: Nguyen, M.-H., Yang, C.-T., Cassini, T. A., Ma, F., Hamid, R., Bastarache, L., Peterson, J. F., Xu, H., Li, L., Ma, S., Shyr, C.

Published 2026-03-27

📖 5 min read🧠 Deep dive

View on medRxiv ↗PDF ↗

CC BY 4.0

Original authors: Nguyen, M.-H., Yang, C.-T., Cassini, T. A., Ma, F., Hamid, R., Bastarache, L., Peterson, J. F., Xu, H., Li, L., Ma, S., Shyr, C.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery, but the clues are incredibly faint, the suspects are thousands of different rare conditions, and you've never seen this specific crime scene before. This is the daily reality for doctors diagnosing rare diseases.

Now, imagine you have a super-smart, super-fast assistant who has read almost every book, medical journal, and case report ever written. This is a Large Language Model (LLM). The big question researchers asked was: Can this AI assistant actually help solve these medical mysteries, or is it just making things up?

This paper is a "meta-analysis," which is like a super-review. Instead of looking at just one study, the authors gathered 15 different studies (covering nearly 40,000 cases) to see what the big picture looks like.

Here is the story of their findings, explained simply:

1. The Big Score: It's a "C-" Student

If you asked these AI assistants to pick the one correct rare disease from a list of thousands, they got it right about 43% of the time.

The Analogy: Imagine a multiple-choice test with 100 questions. The AI gets 43 right. That's better than guessing (which would be near 0%), but it's far from a passing grade for a life-or-death medical diagnosis.
The Problem: The results were all over the place. In some tests, the AI was a genius; in others, it was completely lost. This is called "high heterogeneity"—the scores varied wildly depending on how the test was set up.

2. The "Training Wheels" Effect (Augmentation)

The researchers found that the AI performed much better when it wasn't just "winging it" with its internal memory.

Standalone AI: This is like a student taking a test with only what they memorized in school. They got the right answer about 35% of the time.
Augmented AI: This is the same student, but now they are allowed to use a textbook, ask a librarian, or use a step-by-step reasoning checklist during the test. These systems got the right answer about 52% of the time.
The Takeaway: Giving the AI access to outside medical databases and tools (like a "search engine" for diseases) helps it significantly, but it still isn't perfect.

3. The "Trick Question" Problem (Benchmark Bias)

This is the most surprising and important finding. The researchers realized that the test itself was rigged in different ways.

The Analogy: Imagine two driving tests.
- Test A (RareBench): The test track has wide roads, clear signs, and common obstacles. The drivers (AIs) score high here.
- Test B (Phenopacket Store): The test track is a dark forest with no signs, hidden potholes, and extremely rare, tricky obstacles. The drivers score very low here.
The Reality: The "Phenopacket Store" dataset contained a huge number of ultra-rare diseases (conditions so rare they affect fewer than 1 in a million people). The AI struggled immensely with these. The "RareBench" dataset had slightly more common rare diseases, and the AI did much better.
The Lesson: If you only test the AI on "easy" rare diseases, it looks like a hero. If you test it on the really hard, ultra-rare ones, it looks like a novice. The paper argues we need to stop using "easy" tests if we want to know if the AI is ready for the real world.

4. The "Hall of Mirrors" (Bias and Safety)

The researchers looked closely at how these studies were done and found a major red flag: Data Leakage.

The Analogy: Imagine a student taking a math test, but they accidentally studied the exact same questions from the answer key before the exam started. They get a 100%, but they didn't actually learn the math.
The Reality: Many of the AI models were trained on data that included the very cases they were being tested on. They weren't solving new mysteries; they were just remembering the answers.
The Verdict: Because of this, and because no study actually tested the AI on real patients in a real hospital (prospective validation), the authors say: "Do not use this in a hospital yet."

The Bottom Line

Large Language Models show promise for helping doctors diagnose rare diseases, especially when they are given tools to look up information as they work. However, right now:

They are only right about half the time.
They perform poorly on the rarest, most difficult cases.
Many studies are "cheating" by testing on data the AI has already seen.

The Final Metaphor:
Think of these AI tools as rookie detectives. They are smart, they read a lot, and with the right tools, they can solve some cases. But they haven't yet proven they can handle the most complex, real-world crimes without making mistakes. Before we let them lead the investigation, we need to give them a fair, tough test that doesn't let them peek at the answers, and we need to see them work in a real police station (hospital) before we trust them with the case file.

1. Problem Statement

Rare diseases affect over 300 million people globally, yet patients often face diagnostic odysseys lasting 4–8 years due to clinical heterogeneity and limited clinician familiarity. While Large Language Models (LLMs) have emerged as potential tools to synthesize complex phenotypic and genomic data for diagnosis, the current evidence regarding their accuracy is fragmented.

Key Gap: Reported diagnostic performance varies wildly across studies due to differences in evaluation benchmarks, input modalities (structured vs. unstructured), and LLM augmentation strategies.
Objective: To synthesize available evidence on the diagnostic performance of LLMs for rare diseases, identify sources of heterogeneity, and assess the readiness of these systems for clinical translation.

2. Methodology

The study followed PRISMA-DTA guidelines and was registered prospectively on the Open Science Framework.

Search Strategy: A comprehensive search was conducted across six databases (PubMed, Embase, Web of Science, Cochrane Library, arXiv, medRxiv) from January 2020 to February 2026.
Eligibility Criteria:
- Systems using LLMs as the primary diagnostic reasoning component.
- Evaluation on rare diseases with a defined cohort (>10 cases).
- Reporting of strict Recall@1 (R@1) (proportion of cases where the correct diagnosis is ranked first).
- Exclusion of studies focusing solely on gene prioritization or those without extractable R@1 metrics.
Data Synthesis:
- Meta-Analysis: Pooled R@1 was calculated using the Freeman-Tukey double arcsine transformation with DerSimonian-Laird random-effects models.
- Subgroup Analyses: Pre-specified analyses examined augmentation strategies (standalone vs. augmented) and input modalities (structured HPO vs. unstructured text).
- Post-hoc Exploratory Analysis: Investigated benchmark disease composition by mapping diseases from major benchmarks (RareBench, Phenopacket Store, etc.) to Orphanet prevalence classifications (Ultra-rare <1/million, Rare 1–9/million, Higher prevalence).
Risk of Bias: Assessed using a modified QUADAS-3 instrument covering seven domains, with a specific focus on data leakage and reproducibility.

3. Key Contributions

First Meta-Analysis: This is the first systematic review and meta-analysis quantifying the pooled diagnostic accuracy of LLMs specifically for rare disease diagnosis.
Benchmark Composition Analysis: It introduces a novel post-hoc analysis linking diagnostic performance to the prevalence distribution of diseases within evaluation benchmarks, revealing that benchmarks with higher proportions of "ultra-rare" diseases yield significantly lower accuracy.
Bias Assessment: It provides a critical evaluation of the current literature, highlighting that all included studies suffer from high risk of bias, primarily due to potential data leakage and lack of independent validation.

4. Key Results

Study Selection: From 902 records, 15 studies (contributing 19 system-dataset entries) met inclusion criteria, covering a total of 39,529 cases.
Overall Diagnostic Accuracy:
- The pooled Recall@1 (R@1) was 43.3% (95% CI: 35.1–51.6).
- Heterogeneity: Extremely high ( $I^2 = 99.6\%$ ), indicating substantial variation in performance across different settings.
Impact of Augmentation Strategies:
- Augmented Systems (agent-based reasoning, retrieval-augmented, or fine-tuned; $k=8$ ) achieved a significantly higher R@1 of 52.5% (95% CI: 42.0–62.9).
- Standalone LLMs ( $k=11$ ) achieved an R@1 of 35.4% (95% CI: 30.6–40.4).
- Statistical Significance: $p = 0.004$ .
Impact of Benchmark Composition (Post-hoc Analysis):
- RareBench (29.3% ultra-rare diseases) yielded a pooled R@1 of 52.0%.
- Phenopacket Store (52.8% ultra-rare diseases) yielded a significantly lower pooled R@1 of 21.7% ( $p < 0.001$ ).
- Correlation: A negative trend was observed where a 10% increase in the proportion of ultra-rare diseases corresponded to a 5.8 percentage point decrease in diagnostic accuracy ( $R^2 = 0.55$ ).
Input Modality: No significant difference was found between structured HPO inputs (39.6%) and unstructured clinical text (47.3%) in the pooled analysis ( $p=0.35$ ).
Risk of Bias: 100% (19/19) of the system-dataset entries were rated at high risk of bias.
- Primary concerns: Potential data leakage between training/pre-training corpora and evaluation datasets (18/19 entries) and lack of independent reproducibility.
- No study provided prospective clinical validation or evaluated real-world clinical outcomes (e.g., time to diagnosis).

5. Significance and Implications

Clinical Readiness: The findings indicate that LLM-based systems are not yet ready for clinical deployment. The high risk of bias, lack of prospective validation, and performance drop on ultra-rare disease benchmarks suggest current models may fail in real-world scenarios where data is sparse and diseases are extremely rare.
Benchmark Standardization: The study underscores the need for prevalence-stratified evaluation benchmarks. High performance on datasets dominated by "rare" but not "ultra-rare" diseases (like RareBench) does not guarantee performance on the most difficult, ultra-rare cases (like those in the Phenopacket Store).
Future Directions:
- Development of benchmarks that reflect the true distribution of ultra-rare diseases.
- Mandatory reporting of disease prevalence within evaluation sets.
- Conduct of independent, prospective clinical trials to measure impact on diagnostic time and patient outcomes.
- Implementation of rigorous protocols to prevent data leakage between training and testing phases.

In conclusion, while LLMs show promise—particularly when augmented with external knowledge or agent-based reasoning—the current evidence base is too fragmented and biased to support immediate clinical use. The diagnostic accuracy is heavily dependent on the specific characteristics of the evaluation benchmark, particularly the prevalence of the diseases being tested.

Diagnostic Accuracy of Large Language Models for Rare Diseases: A Systematic Review and Meta-Analysis