An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification

This paper empirically demonstrates that uncertainty-based selective prediction often fails in multimodal clinical condition classification due to severe class-dependent miscalibration, where models incorrectly defer accurate predictions while retaining uncertain ones, thereby highlighting the limitations of standard aggregate metrics and the critical need for calibration-aware evaluation in safety-critical AI.

L. Julián Lechuga López, Farah E. Shamout, Tim G. J. Rudner

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you are a doctor in a busy emergency room. You have a new, super-smart AI assistant that can look at a patient's medical history (like a long list of notes) and their chest X-rays to predict if they have one of 25 different serious conditions, like heart failure, pneumonia, or kidney issues.

This AI is very good at its job. When you ask it, "Does this patient have a problem?" it usually says "Yes" or "No" with a high degree of accuracy. In fact, if you just look at its overall score, it seems like a miracle worker.

But here is the catch: The AI is terrible at knowing when it is unsure.

The "Overconfident Guessing" Problem

Think of this AI like a student taking a multiple-choice test who is convinced they know the answers, even when they are completely wrong.

  • The Good News: When the student is right, they are confident.
  • The Bad News: When the student is wrong, they are also confident.
  • The Worst News: When the student is right about a rare, tricky question, they might suddenly become unsure and say, "I don't know!"

In the world of AI, this is called miscalibration. The AI's "confidence score" doesn't match reality. It thinks it's sure when it's actually guessing, and it thinks it's guessing when it's actually sure.

The "Selective Prediction" Safety Net

To fix this, the researchers tried to give the AI a "safety net" called Selective Prediction.

Imagine the AI has a rule: "If I'm not at least 90% sure, I will stop and say, 'Doctor, please look at this one yourself.'"

The idea is that the AI should only make predictions when it's confident, and hand over the tricky, uncertain cases to a human expert. This should make the system safer. If the AI is unsure, it stays quiet, and the human steps in.

What the Researchers Found

The researchers tested this safety net on the AI using real hospital data. They expected the AI to get better at its job by handing over the hard cases. Instead, they found a disaster.

The Analogy of the Broken Smoke Detector:
Imagine a smoke detector that is supposed to scream when there is a fire.

  • Ideally: It screams when there is smoke (Fire = Alarm). It stays quiet when there is no smoke (No Fire = Silence).
  • The Reality in this study: The detector is broken.
    • Sometimes, there is a huge fire (a real patient with a rare disease), but the detector stays silent because it's "confident" there's no fire. Result: The patient gets missed.
    • Other times, there is just a piece of toast burning (a common, easy condition), and the detector screams "FIRE!" because it's confused. Result: The doctor gets called for no reason, wasting time.

Because the AI was so confused about which conditions it was bad at (especially the rare ones), the "safety net" didn't work.

  • The AI would refuse to diagnose the rare, dangerous diseases (thinking it was unsure), leaving the doctor to catch them.
  • But it would also confidently diagnose common diseases incorrectly, or refuse to diagnose common diseases when it should have.

The Result: When the AI started "refusing" to answer, the overall performance of the system actually got worse, not better. The safety net was full of holes.

Why Did This Happen?

The problem was Class-Dependent Miscalibration.

Think of the 25 diseases as different types of fruit in a basket.

  • Common fruits (Apples): There are thousands of them. The AI has seen them a million times. It knows them well.
  • Rare fruits (Jackfruit): There are only a few. The AI has barely seen them.

The AI was great at identifying Apples. But when it saw a Jackfruit, it got confused.

  • Sometimes it thought the Jackfruit was an Apple (and was very confident it was right).
  • Sometimes it thought the Jackfruit was a mystery object (and was very unsure).

Because the AI couldn't tell the difference between "I know this rare fruit" and "I'm guessing," the safety mechanism (Selective Prediction) failed. It couldn't tell the doctor which rare cases needed human help.

Did They Fix It?

The researchers tried a simple trick: Loss Upweighting.
This is like telling the AI: "Hey, you keep messing up the Jackfruits. Every time you get a Jackfruit wrong, I'm going to give you a double penalty. Try harder!"

  • Did it help? A little bit. The AI got slightly better at knowing when it was unsure about the rare fruits.
  • Did it fix the problem? No. The safety net was still broken. The AI still couldn't reliably decide when to ask for help.

The Big Takeaway

This paper is a warning sign for the future of AI in healthcare.

  1. High Scores Lie: Just because an AI has a high "accuracy score" (like getting 90% of a test right) doesn't mean it's safe to use in a hospital.
  2. Confidence is Key: For AI to be safe, it needs to know when it doesn't know. Currently, our best AI models are bad at this, especially for rare diseases.
  3. The "Fail-Safe" is Broken: We can't just rely on the AI to say, "I'm not sure, you check it." Right now, the AI is too confused to know when to say that.

In short: We are building very smart AI doctors, but they are like overconfident interns who think they know everything. Until we teach them to be humble and know when to ask for help, we can't fully trust them to keep patients safe.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →