Conformal Selective Prediction with General Risk Control

Imagine you have a very smart but sometimes overconfident friend who loves to give advice. Sometimes, their advice is spot-on and saves you hours of work. Other times, they are guessing wildly, and following their advice could cost you money or get you into trouble.

The big question is: How do you know when to listen to them and when to say, "No, I'll figure this out myself"?

This paper introduces a new system called SCoRE (Selective Conformal Risk control with E-values) to solve exactly that problem. It's a "trust filter" for Artificial Intelligence that works even when the AI is a "black box" (we don't know how it thinks) and the risks are complex (not just "right" or "wrong," but "a little expensive" or "very dangerous").

Here is how SCoRE works, broken down into simple concepts:

1. The Problem: The "Guessing Game"

In the past, if an AI model was unsure, it might just refuse to answer. But that's too simple.

Scenario A (Drug Discovery): An AI suggests a new drug. If it's right, we save millions. If it's wrong, we waste money on lab tests. The "risk" here is the cost of the wasted money.
Scenario B (Hospital Care): An AI predicts how long a patient will stay in the ICU. If it's wrong, the hospital might overbook beds or under-staff. The "risk" here is the squared error (how far off the prediction was).

The challenge is that these risks aren't just "Yes/No." They are continuous numbers (dollars, days, errors). We need a way to say, "I will trust the AI on this specific case, but only if I can guarantee the average cost of my mistakes stays below a certain limit."

2. The Solution: The "E-Value" Ticket

The authors use a clever statistical tool called an E-value. Think of an E-value as a ticket the AI has to buy to get into the "Trusted Zone."

The Rule: To get a ticket, the AI must prove that the "price" of being wrong (the risk) is low enough.
The Math Magic: The paper creates a special formula that calculates this ticket price based on past data the AI has already seen (calibration data).
The Guarantee: If the AI's ticket price is high enough, we know with mathematical certainty that even if we are wrong, the average cost of our mistakes won't exceed our budget.

It's like a casino. The casino (the AI) wants to let you play. The E-value is the house's way of saying, "We have checked the odds, and if you play only when the odds are in your favor, we guarantee we won't lose more than $100 on average."

3. Two Ways to Measure "Safety"

The paper introduces two different ways to set your safety budget, depending on your goal:

MDR (Marginal Deployment Risk): The "Total Budget" Approach

Analogy: Imagine you have a $1,000 wallet for the whole day.
Goal: You can make as many mistakes as you want, as long as the total money you lose by the end of the day doesn't exceed $1,000.
Best for: Situations where you have a fixed budget and don't care if you make a few big mistakes, as long as the total damage is contained. (e.g., "We have $1M for drug trials; we can't spend more than that.")

SDR (Selective Deployment Risk): The "Average Cost" Approach

Analogy: Imagine you are a quality control inspector. You can reject as many bad products as you want, but for the ones you do ship, the average number of defects must be very low.
Goal: You want to ensure that every single time you trust the AI, the risk is low. You care about the average quality of your trusted decisions.
Best for: Situations where you can't afford any bad outcomes to slip through, or where the cost scales with the number of decisions. (e.g., "Every time we release a medical report, it must be highly accurate.")

4. How It Works in Real Life (The Examples)

The paper tested SCoRE in three real-world scenarios:

Finding New Drugs:
- The AI: Predicts if a chemical will bind to a virus.
- The Risk: If the AI is wrong, we waste money on lab tests.
- SCoRE's Job: It filters out the chemicals that look promising but have a high chance of being expensive failures. It ensures the average cost of the chemicals we actually test stays low.
Hospital ICU Predictions:
- The AI: Predicts how many days a patient stays in the ICU.
- The Risk: If the prediction is off by 5 days, the hospital planning goes haywire.
- SCoRE's Job: It only trusts the AI's prediction when it's very confident. If the AI is unsure, SCoRE says "No," and a human doctor takes over. This keeps the total error in hospital planning low.
AI Radiology Reports:
- The AI: Writes a report about an X-ray.
- The Risk: If the AI misses a tumor or invents a fake one, it's dangerous.
- SCoRE's Job: It checks the AI's report against a "confidence score." If the AI is confident and the risk of a semantic error (meaning the report makes sense but is wrong) is low, it lets the report go to the doctor. If not, it flags it for human review.

5. Why This is a Big Deal

Before this paper, most AI safety tools were like a binary light switch: "Safe" or "Unsafe." They couldn't handle the nuance of "This is risky, but maybe worth it if the reward is high."

SCoRE is like a dimmer switch. It allows us to:

Be flexible: We can choose to be very strict (low risk) or a bit more relaxed (higher risk, higher reward).
Be robust: It works even if the data changes (e.g., the AI sees patients from a different city than it was trained on).
Be efficient: It doesn't just say "No" to everything. It finds the "sweet spot" where we can trust the AI enough to save time and money, without breaking the bank.

Summary

SCoRE is a smart gatekeeper. It doesn't just ask, "Is the AI right?" It asks, "Is the AI right enough given the cost of being wrong?" By using a special mathematical ticket system (E-values), it guarantees that if we follow its advice, we will never spend more on mistakes than we planned to. It turns AI from a reckless gambler into a disciplined partner.

1. Problem Statement

The paper addresses the challenge of Selective Prediction in Artificial Intelligence (AI) deployment. In high-stakes applications (e.g., drug discovery, clinical diagnosis, radiology), it is often safer to have a model "abstain" from making a prediction when it is uncertain, rather than making a potentially harmful error.

The core problem is: Given a black-box model $f$ , labeled calibration data, and a new test instance, how can we derive a binary trust decision ( $\psi = 1$ for trust/deploy, $\psi = 0$ for abstain) that guarantees strict control over the risk among the deployed instances?

Key Challenges Identified:

Continuous Risks: Most existing methods focus on binary risks (e.g., "is the prediction wrong?"). However, many applications involve continuously-valued risks (e.g., development cost of a false lead, squared error in ICU stay prediction, semantic distance in LLM reports).
Risk Definitions: There are two distinct objectives for risk control:
1. Marginal Deployment Risk (MDR): Controls the total expected risk accumulated over all deployed instances ( $E[L \cdot \psi]$ ). This is useful when there is a fixed budget for total errors.
2. Selective Deployment Risk (SDR): Controls the average risk per deployed instance ( $E[\frac{\sum L \cdot \psi}{\sum \psi}]$ ). This generalizes the False Discovery Rate (FDR) to continuous risks and ensures that every selected instance is individually reliable.
Distribution Shift: Real-world data often shifts between training/calibration and testing phases, requiring robustness to covariate shifts.

2. Methodology: SCoRE (Selective Conformal Risk control with E-values)

The authors propose SCoRE, a framework that unifies conformal inference, hypothesis testing, and E-values to achieve finite-sample, distribution-free risk control.

Core Concept: Risk-Adjusted E-values

Instead of using p-values (which control tail probabilities), SCoRE constructs E-values ( $E$ ), which are non-negative random variables satisfying $E[E \cdot L] \leq 1$ , where $L$ is the unknown risk.

Definition: A random variable $E_{n+j}$ is a risk-adjusted e-value if $E_{n+j} \geq 0$ and $\mathbb{E}[L_{n+j} E_{n+j}] \leq 1$ .
Intuition: A large value of $E_{n+j}$ provides evidence that the risk $L_{n+j}$ is small.

The Workflow

Calibration: Use labeled calibration data to construct risk-adjusted e-values for test points based on a pre-trained score function $s(X)$ (which estimates risk or uncertainty).
Decision Making:
- For MDR Control: Apply a simple thresholding rule. Deploy if $E_{n+j} \geq 1/\alpha$ .
- For SDR Control: Apply the e-BH (e-value Benjamini-Hochberg) procedure to the set of e-values to select a subset of instances while controlling the average risk.
Construction of E-values:
- The authors derive specific formulas for $E_{n+j}$ that involve taking an infimum over possible risk values $\ell \in [0,1]$ . This ensures validity even when the true risk is unknown.
- The construction relies on data exchangeability (or weighted exchangeability for covariate shift) rather than uniform concentration inequalities, allowing for finite-sample guarantees without distributional assumptions.

Handling Covariate Shift

The framework extends to settings where calibration and test data come from different distributions (covariate shift) by incorporating importance weights $w(x)$ into the e-value construction. The paper also discusses doubly robust properties, where validity is maintained if either the weight estimation or the risk model is consistent.

Optimality and Power

The paper analyzes the "power" of the procedure (the ability to deploy useful instances). It derives Neyman-Pearson-like asymptotic optimality results:

For MDR, the optimal score function $s(x)$ should rank instances by the ratio of expected risk to expected reward ( $l(x)/r(x)$ ).
For SDR, the optimal score should rank by the "excess risk per unit reward" ( $(l(x) - \alpha)/r(x)$ ).

3. Key Contributions

Generalization to Continuous Risks: SCoRE is the first framework to provide finite-sample, distribution-free guarantees for continuously-valued risks in selective prediction, moving beyond binary error control.
Unified Framework: It introduces two complementary metrics (MDR and SDR) and provides a unified methodology to control both using E-values and hypothesis testing.
Novel E-value Construction: The paper defines "risk-adjusted e-values" specifically for prediction problems, bridging the gap between conformal inference and E-value theory.
Robustness: The method handles covariate shifts and offers asymptotic double robustness when weights are estimated.
Practical Algorithms: The authors provide efficient algorithms (avoiding expensive grid searches) and boosting strategies (heterogeneous/homogeneous) to improve selection power without sacrificing control.

4. Experimental Results

The authors evaluated SCoRE through simulations and three real-world applications:

Drug Discovery: Selecting drug candidates with high binding affinity while controlling the development cost of false leads.
- Result: SCoRE successfully controlled the average cost (SDR) and total cost (MDR) under covariate shift, outperforming baselines based on uniform concentration inequalities which were overly conservative.
Clinical Prediction (ICU Stay): Selecting accurate predictions for patient length of stay.
- Result: The method controlled the mean squared error (MSE) among deployed predictions, ensuring high reliability for clinical decision support.
LLM Radiology Reports: Selecting AI-generated reports that are semantically close to expert references.
- Result: SCoRE controlled the semantic error rate, allowing for the safe deployment of Large Language Models in medical reporting.

Simulation Findings:

SCoRE achieved tight control of the target risk levels (MDR/SDR) across various data generating processes and risk functions (excess risk, L2 risk, sigmoid risk).
The risk-reward ratio score functions often yielded higher total utility compared to simple risk prediction scores.
Boosting strategies (using random factors to perturb e-values) significantly improved the number of selected instances (power) while maintaining strict error control.

5. Significance

This paper represents a significant advancement in Trustworthy AI.

From Binary to Continuous: It solves a critical gap in the literature where most selective prediction methods only handle binary errors, whereas real-world costs are often continuous and variable.
Finite-Sample Guarantees: Unlike asymptotic methods, SCoRE provides rigorous guarantees for finite datasets, which is crucial for high-stakes domains with limited data.
Flexibility: By decoupling the risk control mechanism from the underlying model architecture, SCoRE can be applied to any black-box model (regression, classification, LLMs).
Practical Impact: The framework enables practitioners to deploy AI systems with quantifiable, strict safety guarantees, facilitating the adoption of AI in sensitive fields like healthcare and drug development.

In summary, SCoRE provides a mathematically rigorous, flexible, and practical toolkit for deploying AI models selectively, ensuring that the cost of errors is strictly bounded regardless of the complexity of the risk function or the presence of distribution shifts.