Decomposing Physician Disagreement in HealthBench
This paper analyzes physician disagreement in the HealthBench dataset, revealing that while the majority of variance is structural and irreducible, a small but actionable portion stems from reducible uncertainties like missing context, suggesting that improving evaluation design to close information gaps could meaningfully reduce disagreement on borderline medical AI cases.