Decomposing Physician Disagreement in HealthBench

Imagine you are trying to grade a stack of essays written by a new AI student. You hire a panel of 186 expert teachers (doctors) to read the essays and decide: "Is this answer good enough to pass?" or "Is it a fail?"

You expect the teachers to mostly agree. But in this study, they disagreed on 22.5% of the essays. That's like flipping a coin for nearly one out of every four answers.

The authors of this paper asked a big question: "Why do the teachers disagree?" Is it because the teachers are inconsistent? Is it because the grading rules are vague? Or is it something else entirely?

Here is the breakdown of their findings, explained with simple analogies.

1. The "Who" and the "What" Don't Matter Much

The researchers tried to find the culprit by looking at two main suspects:

The Teacher (The Physician): Do some doctors just have a stricter or more lenient personality?
- The Verdict: No. The teachers were surprisingly consistent with each other. Their personal "style" only explained about 2.4% of the disagreement. It's not that Dr. Smith is a harsh grader and Dr. Jones is a soft one; they mostly grade the same way.
The Grading Rubric (The Rules): Are some specific questions harder to grade than others?
- The Verdict: A little bit, but not really. The type of rule being used explained about 16% of the pass/fail decisions, but it only explained 4% of the disagreements. Even with clear rules, the teachers still argued.

2. The Real Culprit: The "Specific Case" Mystery

If it's not the teacher and not the rule, what is it?
The study found that 81.8% of the disagreement comes from the specific combination of the question, the AI's answer, and the rule being applied.

The Analogy: Imagine a game of "Telephone."

If you ask a teacher, "Is a red apple good?" they will all say "Yes."
If you ask, "Is a bruised, half-eaten apple good?" they might all say "No."
But if you ask, "Is this specific apple, which is red but has a tiny, weird-shaped bruise on the left side, good?" that's where the argument happens.

The disagreement isn't about the teacher or the rulebook; it's about the unique, messy details of that specific situation. In medical AI, the "bruise" is a tiny missing detail in the AI's answer or a slightly ambiguous phrase in the prompt.

3. The "Inverted-U" of Confusion

The researchers found a funny pattern in when the teachers argue.

Easy Cases: If the AI gives a perfect answer, everyone agrees it's great.
Terrible Cases: If the AI gives a nonsense answer, everyone agrees it's bad.
The "Gray Zone": The teachers only argue when the answer is just okay. It's not clearly good, but it's not clearly bad.

The Metaphor: Think of a dimmer switch.

At 100% brightness (Great answer), everyone sees the light.
At 0% brightness (Bad answer), everyone sees the dark.
At 50% brightness (The middle), some people think it's "bright enough," and others think it's "too dim." That's where the fight happens.

4. The "Missing Puzzle Piece" vs. "Genuine Mystery"

This is the most important discovery. The researchers looked at why the answers were in that "Gray Zone." They found two types of confusion:

Type A: Missing Information (Fixable). The AI's answer was vague because the question didn't give enough context.
- Example: "What medicine should I take?" (The AI doesn't know your age or allergies).
- Result: When the teachers realized the context was missing, they argued twice as often.
Type B: Genuine Medical Ambiguity (Unfixable). The question was about a medical gray area where even humans don't know the answer.
- Example: "Is this rare symptom caused by Disease X or Disease Y?" (Even experts debate this).
- Result: Surprisingly, this did not cause more arguing. The teachers actually agreed more on these hard medical mysteries than on the missing-information cases.

The Lesson: The teachers aren't arguing because medicine is confusing; they are arguing because the AI didn't get all the facts. If we give the AI better instructions and more context, we can fix a lot of the arguing.

5. The "Ceiling" on AI Performance

The study concludes that there is a "ceiling" on how well we can test medical AI.
Because the teachers themselves can't agree on 22.5% of the cases, an AI can never score 100% perfect. Even if the AI is perfect, it might get marked "wrong" simply because one teacher thought the answer was good and another thought it was bad.

The Final Takeaway:

Don't blame the teachers: They are doing a good job.
Don't blame the rules: The rules are mostly fine.
Fix the "Missing Pieces": The biggest source of disagreement is when the AI is asked to guess without enough information. If we design better tests that give the AI all the necessary context, we can reduce the arguing.
Accept the "Gray Zone": Some disagreement is just part of the job. In medicine, sometimes there is no single "right" answer, and that's okay.

In short: The AI isn't failing because it's stupid; it's failing because the test questions are sometimes missing the clues needed to give a clear answer.

Here is a detailed technical summary of the paper "Decomposing Physician Disagreement in HealthBench."

1. Problem Statement

The reliability of Large Language Models (LLMs) in healthcare is increasingly critical, yet evaluating their performance relies on expert human judgment, which is inherently noisy. In the HealthBench dataset, a large-scale medical AI evaluation benchmark, 22.5% of cases result in disagreement among physicians regarding whether a model's response meets clinical standards.

While previous literature attributes this to "ambiguity in criteria," "clinical specialization," or "inherent medical ambiguity," there is a lack of quantitative decomposition regarding:

Where the variance in disagreement actually resides (e.g., is it the physician, the rubric, or the specific case?).
What observable features explain this disagreement.
Whether the "agreement ceiling" is a structural limitation of the task or a solvable data/design issue.

2. Methodology

The authors conducted a comprehensive variance decomposition analysis across nine phases using the HealthBench meta-evaluation dataset.

Dataset:
- 60,896 individual physician judgments.
- 29,511 unique cases (prompt × completion × rubric).
- 186 anonymized physicians grading against 34 consensus rubric criteria.
- Overall disagreement rate: 22.5%.
Statistical Framework:
- Variance Decomposition: Used Linear Mixed Models (LMM) and Generalized Linear Mixed Models (GLMM) following Generalizability Theory.
- Components Analyzed: Physician identity (Level Noise), Rubric identity, and Residual (Case-level/Pattern Noise).
- Predictive Modeling: Tested various features (rubric language, metadata, surface features, embeddings) to predict disagreement using Logistic Regression and AUC metrics.
- Uncertainty Analysis: Leveraged a separate "Consensus Dataset" with physician-validated tags for reducible uncertainty (missing context, ambiguous phrasing) vs. irreducible uncertainty (genuine medical ambiguity).

3. Key Contributions

Quantitative Decomposition of Variance: The study provides the first granular breakdown of disagreement sources in medical AI evaluation, distinguishing between "level noise" (physician bias), "pattern noise" (case-specific interactions), and "occasion noise."
The "Case-Specificity" Dominance: It establishes that the vast majority of disagreement is not due to who is grading or which rubric is used, but is specific to the unique interaction of a specific case, rubric, and completion.
Dissociation of Uncertainty Types: It empirically distinguishes between reducible uncertainty (information gaps) and irreducible uncertainty (inherent medical ambiguity), showing they have opposite effects on disagreement.
Structural Ceiling Identification: It quantifies the "agreement ceiling," demonstrating that current evaluation metrics are bounded by human inconsistency rather than just model capability.

4. Key Results

A. Variance Decomposition (Where does the variance live?)

Physician Identity (Level Noise): Accounts for only 2.4% of label variance. Physicians are surprisingly consistent with one another; individual leniency is not the primary driver of disagreement.
Rubric Identity: Accounts for 15.8% of label variance (whether a case passes) but only 3.6–6.9% of disagreement variance.
Case-Level Residual (Pattern + Occasion Noise): Dominates the landscape, accounting for 81.8% of the variance. This indicates that disagreement is highly specific to the unique combination of the prompt, the model output, and the rubric criterion.

B. Feature Analysis (What explains the disagreement?)

Medical Specialty: No significant difference found between specialties (0 of 300 pairwise comparisons significant).
Rubric Language: Normative language (subjective criteria) slightly increases disagreement (Pseudo $R^2$ = 1.2%), but the effect is marginal.
Metadata: HealthBench's existing metadata (themes, categories) fails to explain the residual variance ( $z = -0.22, p = 0.83$ ).
Predictive Modeling:
- Surface Features: AUC = 0.58 (weak signal).
- Semantic Embeddings: AUC = 0.485 (performing worse than chance, indicating semantic content alone does not predict disagreement).
Quality Boundary (Inverted-U): Disagreement follows an inverted-U curve relative to completion quality. Physicians agree on clearly good or bad outputs but split on borderline cases (AUC = 0.689).

C. The Uncertainty Dissociation (Crucial Finding)

Using physician-validated tags, the study found a sharp divergence:

Reducible Uncertainty (missing context, ambiguous phrasing): Doubles the odds of disagreement (OR = 2.55, $p < 10^{-24}$ ).
Irreducible Uncertainty (genuine medical ambiguity): No effect on disagreement (OR = 1.01, $p = 0.90$ ).
Implication: Physicians do not disagree more on inherently ambiguous medical questions; they disagree more when information is missing or the scenario is underspecified.

5. Significance and Implications

Redefining the "Agreement Ceiling": The 22.5% disagreement rate is not merely a failure of the AI grader but a structural property of the evaluation task. The macro $F_1$ score of models (e.g., GPT-4.1) is mechanically capped by this human inconsistency.
Actionable Design Improvements: Since irreducible medical ambiguity does not drive disagreement, but information gaps do, the path to reducing disagreement lies in better prompt engineering and scenario design (filling information gaps), not in resolving inherent medical gray areas.
Evaluation Metrics: Current benchmarks that collapse multiple physician labels into a single "ground truth" treat case-level uncertainty as error. The authors advocate for disagreement-aware metrics that distinguish between "model error" and "model alignment with a minority of experts" in contested cases.
Future Directions: The study suggests that the 81.8% residual is likely a mix of systematic case-specific interactions and stochastic physician noise. Future work should focus on case-level information gap annotation and physician self-consistency testing (test-retest) to further isolate the sources of noise.

In conclusion, the paper argues that physician disagreement in medical AI evaluation is largely structural and case-specific, driven more by missing information than by inherent medical ambiguity or physician bias. Closing these information gaps is the most viable strategy for improving evaluation reliability, though a significant structural ceiling will likely remain.