Act or Defer: Error-Controlled Decision Policies for Medical Foundation Models

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a super-smart AI doctor that can look at medical images (like eye scans or tissue slides) and instantly guess what's wrong with a patient. It's incredibly fast and usually very accurate. But here's the problem: AI isn't perfect. Sometimes it's confident but wrong. If a doctor blindly trusts every single guess this AI makes, they might treat healthy people for diseases they don't have, or miss serious conditions in others.

This paper introduces a new "safety layer" called StratCP (Stratified Conformal Prediction). Think of StratCP not as a new doctor, but as a strict, safety-conscious gatekeeper standing between the AI and the real-world doctor.

Here is how it works, using simple analogies:

1. The Problem: The Over-Confident AI

Imagine the AI is a student taking a test. It gets 90% of the answers right on average. But for the hard questions, it starts guessing wildly.

Old Way: The teacher (the doctor) takes every answer the student writes down and puts it in the grade book. If the student guesses "Cancer" for a healthy person, the patient gets scared and gets unnecessary tests.
The Risk: The AI's "average" score looks great, but the mistakes are concentrated on the most dangerous cases.

2. The Solution: The "Gatekeeper" (StratCP)

StratCP changes the game by splitting every patient into one of two groups: The "Go" Group and the "Wait" Group.

🟢 The "Go" Group (Action Arm)

For some patients, the AI is so confident and the data is so clear that StratCP says, "This is safe to act on."

The Analogy: Think of this like a security checkpoint. The AI says, "I'm 99% sure this is a benign mole." StratCP checks the math and says, "Okay, based on our strict safety rules, we are allowed to make a mistake only 5 times out of 100. Since this case is super clear, we can let it pass."
The Result: The doctor can treat the patient immediately without needing more expensive or invasive tests. StratCP guarantees that if they act on these patients, they won't be wrong more than the agreed-upon limit (e.g., 5%).

🟡 The "Wait" Group (Deferral Arm)

For other patients, the AI is confused. Maybe the image is blurry, or the disease looks like something else.

The Analogy: StratCP says, "Stop! Don't guess yet." Instead of forcing a single answer, it hands the doctor a shortlist of possibilities.
The Magic: It says, "We aren't sure if it's Disease A or Disease B, but we are 95% sure it's one of these two."
The Benefit: This tells the doctor exactly what to do next: "Go do a specific blood test to check for A or B." It prevents the doctor from guessing blindly and wasting resources.

3. The "Smart Organizer" (Utility Graph)

Sometimes, the AI's shortlist is messy. It might say, "It could be a broken toe OR a heart attack." These are totally different things requiring different actions.

StratCP has a special feature called a Utility Graph.

The Analogy: Imagine the AI is a chaotic librarian who throws books at you. StratCP is the smart librarian who reorganizes the pile.
If the AI is unsure between "Mild Diabetes" and "Severe Diabetes," StratCP groups them together because the treatment is similar (monitoring).
If the AI is unsure between "Diabetes" and "A Broken Leg," StratCP realizes these are too different and might suggest a different set of tests.
Why it matters: It makes the "Wait" list actually useful for a human doctor, grouping similar conditions so the next step is clear.

4. Real-World Impact: Saving Time and Money

The paper tested this on two big medical areas: Eye Disease and Brain Tumors.

In Eye Disease: StratCP helped doctors decide which eye scans could be treated immediately and which needed a second look. It found more "safe to treat" cases than other methods without making more mistakes.
In Brain Tumors: This is a big deal. Usually, if a pathologist sees a tumor, they have to send it to a lab for expensive molecular testing (like DNA sequencing) to know exactly what kind it is. This takes weeks and costs money.
- StratCP's Win: For many clear-cut cases, StratCP said, "We are confident enough based on the image alone. No need for the expensive lab test."
- The Result: They estimated this could save the US healthcare system $12.5 million a year and cut diagnosis time by weeks, while keeping the error rate safely low.

Summary

StratCP is the bridge between "AI is smart" and "AI is safe to use in a hospital."

It stops the AI from being a "know-it-all" that guesses on everything. Instead, it acts as a risk manager:

It acts when it's safe (saving time and money).
It defers when it's unsure (preventing harm).
It organizes the uncertainty so doctors know exactly what to do next.

It turns a black-box AI prediction into a clear, safe, and actionable medical decision.

1. Problem Statement

The clinical deployment of Medical Foundation Models (FMs) faces a critical safety gap: while these models often achieve high average accuracy, they lack mechanisms to guarantee safety under explicit error budgets (e.g., a cap on false positives).

The Core Issue: Standard point predictions do not indicate when a prediction is reliable enough to trigger immediate clinical action. Errors can concentrate among patients selected for action, leading to harmful interventions or inefficient resource use.
Limitations of Existing Methods:
- Point Predictions: Most FMs output a single label without uncertainty estimates, leading to overconfidence.
- Standard Conformal Prediction (CP): While CP provides marginal coverage guarantees (e.g., 95% of all prediction sets contain the true label), it does not control error rates within specific decision-induced subsets (e.g., the subset of patients a clinician chooses to act upon). Consequently, marginal coverage does not ensure that the "confident" predictions used for action are actually correct.
- Uncertainty Quantification: Bayesian or ensemble-based uncertainty scores are often miscalibrated and lack finite-sample guarantees linked to user-specified error budgets.

2. Methodology: StratCP

The authors introduce StratCP (Stratified Conformal Prediction), a framework that transforms FM predictions into decision-ready outputs by splitting the patient population into two distinct arms: an Action Arm and a Deferral Arm.

A. Core Framework

StratCP operates as a post-processing layer that requires no retraining of the underlying FM. It takes a pre-specified error budget (e.g., False Discovery Rate $\alpha = 0.05$ ) and a target coverage level (e.g., 95%).

Action Arm (Error-Controlled Selection):
- Goal: Select a subset of patients for whom the FM prediction is reliable enough to support immediate clinical action.
- Mechanism: StratCP uses a scoring function (e.g., predicted probability) and calibrates a decision threshold using a labeled calibration set.
- Guarantee: It controls the False Discovery Rate (FDR) among selected patients at the user-specified level. If a patient is selected, the expected fraction of incorrect predictions is $\le \alpha$ .
- Technique: Extends the conformal selection framework using Benjamini-Hochberg procedures to quantify confidence and select patients satisfying the error budget.
Deferral Arm (Calibrated Prediction Sets):
- Goal: Handle patients whose predictions are not confident enough for immediate action.
- Mechanism: For deferred patients, StratCP returns a prediction set (a set of candidate disease statuses) rather than a single label.
- Guarantee: It provides selection-conditional coverage. The prediction set is guaranteed to contain the true disease status for a specified proportion (e.g., 95%) of deferred patients, not just the entire population.
- Technique: Utilizes Joint Mondrian Conformal Inference (JOMI). It constructs a reference group of calibration patients who would also have been deferred under the same selection rule, ensuring the coverage guarantee holds conditionally on the deferral decision.

B. Utility Enhancement (Clinical Coherence)

To improve the clinical utility of prediction sets in the deferral arm, StratCP incorporates a Utility Graph derived from clinical guidelines.

Concept: Instead of adding labels to a prediction set purely based on probability, StratCP orders candidates to maximize "utility," defined as clinical coherence (e.g., grouping adjacent disease stages or tumor grades).
Implementation: A graph encodes relationships (e.g., WHO tumor grades, diabetic retinopathy stages). The algorithm iteratively selects the next label that maximizes utility relative to already selected labels, ensuring the resulting differential diagnosis is clinically actionable (e.g., suggesting similar follow-up tests).

3. Key Contributions

Stratified Decision Policy: The first framework to explicitly separate "act" and "defer" decisions for Medical FMs, providing distinct statistical guarantees for each group (FDR control for action, conditional coverage for deferral).
Model Agnosticism: Works as a post-hoc layer on any pre-trained or fine-tuned FM (e.g., RETFound for retinal imaging, UNI for pathology) without architectural changes.
Clinical Coherence: Introduces a utility-based module that reshapes prediction sets to respect clinical adjacency (e.g., tumor grades), making deferred differentials more useful for clinicians.
Handling Censoring: Extends the framework to time-to-event (survival) tasks, handling right-censoring via Inverse Probability of Censoring Weighting (IPCW) to provide lower survival bounds.

4. Results

The authors evaluated StratCP across ophthalmology (retinal imaging) and neuro-oncology (H&E whole-slide images) tasks.

Ophthalmology (Retinal Imaging):
- Diabetic Retinopathy & Glaucoma: StratCP controlled FDR at 5% for selected patients, whereas standard Top-1 predictions had FDRs up to 55% for severe stages. StratCP selected significantly more patients for action (e.g., 119 vs. 97 in eye condition diagnosis) compared to standard CP while maintaining the error budget.
- Efficiency: StratCP achieved valid coverage with higher efficiency (more actionable patients) than conservative baselines like Thresh or standard CP.
Neuro-Oncology (Pathology):
- IDH Mutation Status: StratCP kept the FDR for IDH-mutant predictions at 0.046 (within the 5% budget), while standard CP exceeded the budget (FDR 0.096).
- CNS Tumor Subtyping: StratCP selected 143 slides with FDR 0.047, whereas CP selected 175 slides but with an uncontrolled FDR of 0.090.
- H&E-Only Diagnosis: In diffuse glioma, StratCP enabled H&E-only sign-outs for ~152 slides (GBM, IDH-wildtype) with FDR $\approx$ 0.052. This could reduce the need for reflex molecular testing, saving an estimated $12.5 million annually and 66,000 laboratory days in the US.
Survival Prediction:
- StratCP identified patients with favorable early survival (≥18 months) with controlled FDR, while providing calibrated lower survival bounds for deferred patients. It outperformed parametric baselines (LNQ) and standard CP in both coverage and efficiency.
Utility Enhancement:
- In CNS tumor subtyping, utility-enhanced StratCP increased the fraction of prediction sets containing subtypes within the same or adjacent WHO grades from 0.70 to 0.90, significantly improving clinical relevance without sacrificing coverage.

5. Significance

Safe Deployment: StratCP bridges the gap between high-accuracy AI models and safe clinical deployment by ensuring that "acting" on a model's output is statistically safe under a defined error budget.
Resource Optimization: By confidently identifying cases that do not require expensive confirmatory testing (e.g., molecular assays), StratCP can significantly reduce healthcare costs and turnaround times.
Principled Deferral: It provides a rigorous mechanism for "knowing when not to know," routing uncertain cases to expert review or further testing with guaranteed coverage, preventing harmful over-treatment.
Generalizability: The framework is applicable across diverse modalities (imaging, EHR) and tasks (diagnosis, prognosis, biomarker prediction), offering a standardized approach for integrating FMs into clinical workflows.

In summary, StratCP transforms medical foundation models from "black box" predictors into decision-ready systems that explicitly manage risk, optimize resource allocation, and provide clinically coherent guidance for both immediate action and further investigation.