An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification

Imagine you are a doctor in a busy emergency room. You have a new, super-smart AI assistant that can look at a patient's medical history (like a long list of notes) and their chest X-rays to predict if they have one of 25 different serious conditions, like heart failure, pneumonia, or kidney issues.

This AI is very good at its job. When you ask it, "Does this patient have a problem?" it usually says "Yes" or "No" with a high degree of accuracy. In fact, if you just look at its overall score, it seems like a miracle worker.

But here is the catch: The AI is terrible at knowing when it is unsure.

The "Overconfident Guessing" Problem

Think of this AI like a student taking a multiple-choice test who is convinced they know the answers, even when they are completely wrong.

The Good News: When the student is right, they are confident.
The Bad News: When the student is wrong, they are also confident.
The Worst News: When the student is right about a rare, tricky question, they might suddenly become unsure and say, "I don't know!"

In the world of AI, this is called miscalibration. The AI's "confidence score" doesn't match reality. It thinks it's sure when it's actually guessing, and it thinks it's guessing when it's actually sure.

The "Selective Prediction" Safety Net

To fix this, the researchers tried to give the AI a "safety net" called Selective Prediction.

Imagine the AI has a rule: "If I'm not at least 90% sure, I will stop and say, 'Doctor, please look at this one yourself.'"

The idea is that the AI should only make predictions when it's confident, and hand over the tricky, uncertain cases to a human expert. This should make the system safer. If the AI is unsure, it stays quiet, and the human steps in.

What the Researchers Found

The researchers tested this safety net on the AI using real hospital data. They expected the AI to get better at its job by handing over the hard cases. Instead, they found a disaster.

The Analogy of the Broken Smoke Detector:
Imagine a smoke detector that is supposed to scream when there is a fire.

Ideally: It screams when there is smoke (Fire = Alarm). It stays quiet when there is no smoke (No Fire = Silence).
The Reality in this study: The detector is broken.
- Sometimes, there is a huge fire (a real patient with a rare disease), but the detector stays silent because it's "confident" there's no fire. Result: The patient gets missed.
- Other times, there is just a piece of toast burning (a common, easy condition), and the detector screams "FIRE!" because it's confused. Result: The doctor gets called for no reason, wasting time.

Because the AI was so confused about which conditions it was bad at (especially the rare ones), the "safety net" didn't work.

The AI would refuse to diagnose the rare, dangerous diseases (thinking it was unsure), leaving the doctor to catch them.
But it would also confidently diagnose common diseases incorrectly, or refuse to diagnose common diseases when it should have.

The Result: When the AI started "refusing" to answer, the overall performance of the system actually got worse, not better. The safety net was full of holes.

Why Did This Happen?

The problem was Class-Dependent Miscalibration.

Think of the 25 diseases as different types of fruit in a basket.

Common fruits (Apples): There are thousands of them. The AI has seen them a million times. It knows them well.
Rare fruits (Jackfruit): There are only a few. The AI has barely seen them.

The AI was great at identifying Apples. But when it saw a Jackfruit, it got confused.

Sometimes it thought the Jackfruit was an Apple (and was very confident it was right).
Sometimes it thought the Jackfruit was a mystery object (and was very unsure).

Because the AI couldn't tell the difference between "I know this rare fruit" and "I'm guessing," the safety mechanism (Selective Prediction) failed. It couldn't tell the doctor which rare cases needed human help.

Did They Fix It?

The researchers tried a simple trick: Loss Upweighting.
This is like telling the AI: "Hey, you keep messing up the Jackfruits. Every time you get a Jackfruit wrong, I'm going to give you a double penalty. Try harder!"

Did it help? A little bit. The AI got slightly better at knowing when it was unsure about the rare fruits.
Did it fix the problem? No. The safety net was still broken. The AI still couldn't reliably decide when to ask for help.

The Big Takeaway

This paper is a warning sign for the future of AI in healthcare.

High Scores Lie: Just because an AI has a high "accuracy score" (like getting 90% of a test right) doesn't mean it's safe to use in a hospital.
Confidence is Key: For AI to be safe, it needs to know when it doesn't know. Currently, our best AI models are bad at this, especially for rare diseases.
The "Fail-Safe" is Broken: We can't just rely on the AI to say, "I'm not sure, you check it." Right now, the AI is too confused to know when to say that.

In short: We are building very smart AI doctors, but they are like overconfident interns who think they know everything. Until we teach them to be humble and know when to ask for help, we can't fully trust them to keep patients safe.

1. Problem Statement

As AI systems move toward clinical deployment, selective prediction (where a model abstains from predicting when uncertainty is high, deferring to a human expert) is proposed as a critical safety mechanism. However, for this mechanism to function, models must provide reliable uncertainty estimates (i.e., they must be well-calibrated).

The paper addresses a critical gap: while multimodal models (combining Electronic Health Records [EHR] and medical imaging) often outperform unimodal baselines in standard discrimination metrics (e.g., AUROC), it is unknown whether they provide reliable uncertainty estimates for multilabel clinical condition classification. Specifically, the authors investigate whether multimodal fusion exacerbates class-dependent miscalibration, where models are overconfident on incorrect predictions for underrepresented conditions, thereby causing selective prediction to fail or even degrade performance.

2. Methodology

Dataset and Task

Task: Multilabel classification of 25 distinct clinical conditions (e.g., acute renal failure, pneumonia, septicemia) during ICU stays.
Data: Paired multimodal data from MIMIC-IV (structured EHR time-series) and MIMIC-CXR (frontal-view chest X-rays).
Challenge: The dataset exhibits significant class imbalance, with many conditions being rare (underrepresented).

Models Evaluated

The study compares several architectures to test the robustness of findings across complexity levels:

Unimodal Baselines:
- EHR: LSTM encoder.
- CXR: ResNet-34 encoder.
Multimodal Architectures:
- MedFuse: Concatenation of LSTM and ResNet features (deterministic baseline).
- DrFuse: Uses divergence-based alignment for representation learning.
- MeTra: Uses a transformer-based cross-modal fusion encoder.
Intervention: A Label-Dependent Loss Upweighting strategy was applied to MedFuse, DrFuse, and MeTra to test if re-weighting rare positive labels could mitigate miscalibration.

Evaluation Metrics

Discrimination: AUROC and AUPRC.
Calibration: Expected Calibration Error (ECE) and Class-Stratified ECE ( $ECE_{c=1}$ for positive class, $ECE_{c=0}$ for negative class).
Selective Prediction: Selective AUROC and Selective AUPRC (measuring performance as uncertain cases are rejected based on a threshold).
Analysis: Correlation analysis between class-stratified calibration errors and selective prediction performance across 25 conditions.

3. Key Contributions

Discovery of Selective Prediction Degradation: The authors demonstrate that while multimodal fusion improves standard discrimination metrics, it often degrades selective prediction performance. Models frequently reject correct predictions (high uncertainty) while accepting incorrect ones (low uncertainty), particularly for rare conditions.
Identification of Class-Dependent Miscalibration: The study reveals that aggregate metrics (like global ECE) mask severe, condition-specific failures. The primary driver of failure is overconfidence in the positive class (underrepresented conditions), where models assign high confidence to incorrect predictions.
Architecture Independence: The failure mode is consistent across diverse multimodal architectures (MedFuse, DrFuse, MeTra), indicating that increased architectural complexity or different fusion mechanisms do not inherently solve calibration issues in imbalanced clinical settings.
Limited Efficacy of Simple Interventions: The paper evaluates loss upweighting as a mitigation strategy. While it reduces calibration error for rare classes, it does not consistently translate into improved selective prediction reliability, highlighting the need for more advanced calibration-aware training.

4. Key Results

Discrimination vs. Calibration Mismatch: Multimodal models (e.g., MedFuse) achieved higher AUROC/AUPRC than unimodal baselines but showed inconsistent or worse calibration. For example, in "Coronary atherosclerosis," MedFuse nearly doubled the calibration error of the EHR baseline despite having better discrimination.
Positive Class Dominance: Across all 25 conditions, $ECE_{c=1}$ (positive class error) was significantly higher than $ECE_{c=0}$ . This indicates systematic overconfidence in predicting the presence of rare diseases.
Correlation with Selective Performance: There is a strong negative correlation between positive-class calibration error and selective prediction metrics (Selective AUROC/AUPRC).
- High $ECE_{c=1}$ $\rightarrow$ Selective performance collapses (often dropping below random chance).
- Low $ECE_{c=1}$ $\rightarrow$ Selective performance improves monotonically.
Failure of Loss Upweighting: While loss upweighting reduced $ECE_{c=1}$ for 23/25 conditions, it failed to produce statistically significant improvements in Selective AUROC/AUPRC. The negative relationship between calibration error and selective performance persisted, suggesting that simply re-weighting the loss is insufficient to fix the underlying reliability gap.
Table 1 Findings: Aggregate metrics suggested multimodal models were superior. However, when broken down by condition, the "improvement" was an illusion caused by masking the severe degradation in selective prediction for specific, high-risk conditions.

5. Significance and Implications

Safety-Critical Warning: The findings suggest that selective prediction cannot be blindly trusted as a fail-safe mechanism in current multimodal clinical AI. Relying on standard aggregate metrics (AUROC) can lead to dangerous overconfidence, where the system rejects correct diagnoses for rare diseases or accepts incorrect ones.
Evaluation Paradigm Shift: The paper argues for a move away from global averages toward class-aware and condition-specific evaluation. Safety guarantees in healthcare require assessing uncertainty specifically for underrepresented subpopulations.
Future Directions: The results highlight that architectural innovation alone is not the solution. Future work must focus on calibration-aware training objectives and evaluation protocols that explicitly account for class imbalance and the specific risks of false negatives in rare conditions.

In conclusion, this paper provides empirical evidence that multimodal fusion in clinical settings introduces a specific failure mode: it improves discrimination but often worsens the reliability of uncertainty estimates for rare conditions, rendering standard selective prediction strategies ineffective without explicit calibration-aware interventions.