A statistical framework for evaluating the… — Plain-Language Explanation

Original authors: Shyr, C., Ren, B., Hsu, C.-Y., Yan, C., Tinker, R. J., Cassini, T. A., Hamid, R., Wright, A., Bastarache, L., Peterson, J. F., Malin, B. A., Xu, H.

Published 2026-03-25

📖 5 min read🧠 Deep dive

View on medRxiv ↗PDF ↗

CC BY 4.0

Original authors: Shyr, C., Ren, B., Hsu, C.-Y., Yan, C., Tinker, R. J., Cassini, T. A., Hamid, R., Wright, A., Bastarache, L., Peterson, J. F., Malin, B. A., Xu, H.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you hire a brilliant, hyper-intelligent robot doctor to help diagnose patients. You ask it, "What is wrong with this person?" and it gives you a perfect answer. You feel relieved. But then, you ask the exact same question again, five minutes later, and it gives you a different answer. Then you ask a third time, and it gives you a third answer.

All three answers might sound reasonable, but they aren't the same. If you were a real doctor, you'd be confused. If you were a patient, you'd be worried. This paper is about building a "consistency meter" to measure exactly how much that robot doctor wobbles.

Here is the breakdown of the paper using simple analogies:

1. The Problem: The "Coin Flip" Doctor

Large Language Models (LLMs) like ChatGPT are amazing at writing and reasoning. But they work a bit like a magician pulling cards from a deck. Even if you ask the exact same question, the model doesn't just "remember" the answer; it guesses the next word based on probabilities.

The Analogy: Imagine asking a friend, "What's the capital of France?" They say "Paris." You ask again immediately. They say "Paris." You ask a third time. They say "Paris." That's consistent.
Now, imagine asking a different friend who is slightly tipsy. You ask, "What's the capital?" They say "Paris." You ask again. They say "Lyon." You ask again. They say "Marseille." They might be right the first time, but they are unreliable because they can't give you the same answer twice.

In medicine, this is dangerous. If a model diagnoses a patient with "Flu" today but "Pneumonia" tomorrow for the same symptoms, doctors can't trust it.

2. The Solution: A New "Consistency Scorecard"

The authors created a statistical framework to measure two things: Repeatability and Reproducibility.

A. Repeatability (The "Same Conditions" Test)

The Concept: If you ask the exact same question to the exact same model with the exact same settings, does it give the same answer?
The Analogy: This is like firing a cannonball at a target 10 times in a row with the same wind and the same gun.
- High Repeatability: All 10 shots hit the bullseye.
- Low Repeatability: The shots scatter all over the field.
The Paper's Twist: They measure this in two ways:
1. Semantic (The Meaning): Did the robot say "It's the flu" and then "It's a viral infection"? The words are different, but the meaning is the same. This is good!
2. Internal (The Brain's Confidence): Did the robot know it was the flu? Or was it just guessing? The paper checks the robot's "brain waves" (probability distributions) to see if it was confident or confused, even if the final words looked similar.

B. Reproducibility (The "Different Conditions" Test)

The Concept: If you ask the question in a slightly different way (e.g., "What is the diagnosis?" vs. "What is the cause?"), does the model still get to the same conclusion?
The Analogy: Imagine asking a detective, "Who stole the cookie?" and then asking, "Who ate the cookie?"
- High Reproducibility: The detective says, "It was the dog," both times.
- Low Reproducibility: The detective says, "It was the dog" the first time, but "It was the cat" the second time.
Why it matters: In real life, doctors ask questions differently. A good AI should be robust enough to handle different phrasing and still land on the right answer.

3. The Experiment: Testing the Robots

The researchers tested this on three different AI models using two types of medical puzzles:

USMLE Questions: These are like standardized textbook exams. The answers are clear-cut.
Real Patient Cases: These are messy, real-life stories from the "Undiagnosed Diseases Network." The symptoms are confusing, and the data is incomplete.

They asked the AI the same questions 100 times each to see how much it wobbled.

4. The Surprising Results

Here is what they found, translated into plain English:

The "Prompt" Matters More Than the Model: It didn't matter which AI model they used (the big expensive one or the small free one). What mattered most was how they asked the question.
- Analogy: It's like asking a student to "Just guess the answer" vs. "Show your work step-by-step using logic." The "Show your work" (specifically, Bayesian reasoning, which is like updating your guess as you get new clues) made the AI much more consistent.
Being Right Doesn't Mean Being Consistent: This is the biggest takeaway. An AI could get the correct diagnosis 100% of the time in a single run, but if you asked it 10 times, it might give 10 different (but all correct-sounding) answers.
- The Lesson: Accuracy (getting it right) and Consistency (getting the same answer) are two different things. You need both for a medical tool.
Real Life is Easier for AI than Exams: Surprisingly, the AI was more consistent when dealing with messy, real-world patient stories than with clean, textbook exam questions. The authors think the detailed stories in real cases force the AI to focus on the specific details, narrowing down its options, whereas the exam questions leave it too much room to wander.

5. Why This Matters

Before this paper, we mostly just checked: "Did the AI get the right answer?"
Now, we can check: "Did the AI get the right answer every time, and was it confident about it?"

This framework is like a quality control checklist for AI doctors. It helps regulators (like the FDA) and hospitals decide: "Is this AI stable enough to trust with a patient's life, or does it wobble too much?"

In short: This paper teaches us that for AI to be a true partner in healthcare, it shouldn't just be smart; it must be steady. And the way we ask it questions is the key to keeping it steady.

1. Problem Statement

Large Language Models (LLMs) are increasingly deployed in biomedical settings for tasks like clinical documentation and diagnostic support. However, current evaluation metrics focus primarily on accuracy (task-level performance), which fails to capture output variability.

The Core Issue: LLMs generate text by sampling tokens from probability distributions. Consequently, identical prompts can yield different outputs across repeated runs. A model might produce a correct diagnosis in one run but a different (potentially incorrect) set of diagnoses in the next.
The Gap: Existing metrics (e.g., BLEU, ROUGE, BERTScore) measure similarity against a ground truth reference but do not quantify the consistency of outputs across repeated runs. Furthermore, regulatory bodies like the U.S. FDA recommend assessing repeatability (agreement under identical conditions) and reproducibility (agreement under different conditions) for AI-enabled medical software, yet systematic frameworks to quantify these for LLMs are lacking.

2. Methodology

The authors propose a regulatory-informed statistical framework that defines and operationalizes four distinct metrics based on two dimensions: Semantic (meaning) and Internal (token-level probability).

A. Definitions

The framework distinguishes between:

Repeatability: Agreement of outputs under identical conditions (same model, prompt, parameters).
Reproducibility: Agreement of outputs under different, pre-specified conditions (e.g., different prompts, users, or models).

B. The Four Metrics

Semantic Repeatability: Measures the consistency of the meaning of outputs across repeated runs under identical conditions.
- Calculation: Uses an embedding function to map outputs to vectors. The score is the average pairwise cosine similarity between these vectors, rescaled to $[0, 1]$ .
Internal Repeatability: Measures the certainty of the model's token-level probability distributions during generation under identical conditions.
- Calculation: Computes the Shannon entropy of the truncated token probability distributions (top-k) at each generation step. Lower entropy implies higher certainty. The score is rescaled based on the maximum possible entropy ( $\log_2 k$ ).
Semantic Reproducibility: Measures the consistency of meaning across different experimental conditions (e.g., different prompts).
- Calculation: Computes the average pairwise cosine similarity between the mean embedding vectors of outputs generated under different conditions.
Internal Reproducibility: Measures the consistency of certainty (entropy) across different experimental conditions.
- Calculation: Computes the average pairwise difference in mean entropy between conditions, rescaled to $[0, 1]$ .

C. Empirical Evaluation Setup

To validate the framework, the authors conducted an empirical study on diagnostic reasoning:

Datasets:
1. MedQA (USMLE): 518 standardized medical licensing exam questions (fully specified, synthetic).
2. Undiagnosed Diseases Network (UDN): 90 real-world, complex, and heterogeneous rare disease cases (often incomplete data).
Models: ChatGPT-4, ChatGPT-4o-mini, and LLaMA 3.2-1B.
Prompts: Five Chain-of-Thought (CoT) strategies (Traditional, Differential Diagnosis, Intuitive, Analytic, Bayesian).
Procedure: Each prompt-case-model combination was run 100 times ( $R=100$ ) with temperature $T=0.5$ and top-k=30.

3. Key Contributions

Novel Framework: The first regulatory-aligned framework to quantify LLM variability using both semantic (output meaning) and internal (generation process) dimensions.
Metric Operationalization: Formal mathematical definitions for Semantic/Internal Repeatability and Reproducibility, providing a standardized way to measure consistency beyond accuracy.
Decoupling Accuracy and Consistency: The study demonstrates that high diagnostic accuracy does not guarantee high repeatability or reproducibility, and vice versa.
Prompt Sensitivity Analysis: The framework reveals that prompting strategies significantly influence model consistency, particularly for semantic repeatability.

4. Key Results

Prompting Strategy Impact: Prompts eliciting Bayesian diagnostic reasoning achieved significantly higher Semantic Repeatability scores compared to other strategies ( $p < 0.001$ ) for ChatGPT-4. This suggests that the method used to elicit reasoning affects how consistently the model arrives at a conclusion.
Dataset Differences:
- USMLE (Standardized): Showed higher variability in repeatability across different prompts.
- UDN (Real-world): Showed tighter clustering of repeatability scores across prompts, suggesting real-world narrative constraints may limit the range of plausible model responses.
Accuracy vs. Consistency:
- There was no significant association between diagnostic accuracy and repeatability/reproducibility scores for most prompts.
- A model can produce a correct diagnosis in a single run but fail to reproduce it consistently, or produce consistent but incorrect outputs.
- Exception: Under the "Intuitive CoT" strategy, correctly diagnosed cases showed significantly higher Internal Repeatability than incorrect ones.
Model Performance:
- ChatGPT-4o-mini showed higher Internal Reproducibility on USMLE cases.
- LLaMA 3.2-1B exhibited higher Semantic Reproducibility scores across both datasets.

5. Significance and Implications

Clinical Safety: In clinical settings, inconsistent outputs can erode trust and complicate decision-making. This framework allows clinicians and developers to assess the robustness of an LLM, not just its peak performance.
Regulatory Compliance: The framework directly addresses FDA draft guidance for AI-enabled medical software, providing a pathway for regulatory submission that includes variability metrics.
Research Rigor: It highlights that evaluating LLMs based on a single run or a single prompt is insufficient. Comprehensive evaluation must account for run-to-run variability and prompt sensitivity.
Future Directions: The authors suggest integrating these metrics with human-centered evaluations (clinician review) and extending the framework to non-autoregressive models where token-level probabilities are not accessible.

In conclusion, the paper establishes that consistency is a distinct and critical property of LLMs separate from accuracy. The proposed framework provides the necessary statistical tools to measure this consistency, enabling more reliable and safe deployment of LLMs in high-stakes biomedical applications.

A statistical framework for evaluating the repeatability and reproducibility of large language models