Conformal Prediction with Corrupted Labels: Uncertain Imputation and Robust Re-weighting

The Big Picture: Predicting the Future with Broken Data

Imagine you are a weather forecaster. Your job is to predict tomorrow's weather and give people a "confidence range" (e.g., "It will be between 60°F and 70°F"). You want to be right 90% of the time.

Usually, you learn from past data. But in this paper, the authors are dealing with a messy situation: The past data is corrupted. Some of the historical weather records are missing, and some have the wrong temperature written down.

If you try to make predictions using this broken data, your confidence ranges will be wrong. You might say "60–70°F" when it's actually going to be a blizzard at 20°F. This is dangerous in high-stakes fields like medicine or finance.

The authors propose three new ways to fix this broken data so your predictions remain reliable.

The Problem: The "Missing Ingredient" Dilemma

To understand their solution, we need to introduce a special ingredient called Privileged Information (PI).

The Scenario: Imagine you are training a doctor to diagnose a disease.
The Data: You have patient records (symptoms, age, etc.) and the diagnosis.
The Corruption: Some records are missing the diagnosis.
The Privileged Information (PI): During training, the doctor had access to a secret, high-tech MRI scan that perfectly predicted the disease. But at the time of testing (when a new patient comes in), that MRI scan is unavailable (maybe it's too expensive or the patient forgot to bring it).

The challenge is: How do you use that secret MRI scan to fix the missing diagnoses in your training data, even though you can't see the MRI scan for new patients?

Solution 1: The "Weighted Scale" (Privileged Conformal Prediction - PCP)

The Analogy: Imagine you are weighing apples to estimate the average weight of a bushel. But, some apples are rotten (corrupted labels). You also know that the rotten apples came mostly from a specific tree (the Privileged Information).

How it works:
The authors suggest using a weighted scale. You give "more weight" to the healthy apples and "less weight" to the rotten ones to balance the scale.

The Catch: To do this perfectly, you need to know exactly how likely an apple is to be rotten based on which tree it came from.
The Paper's Discovery: The authors found that even if you guess the weights wrong (your scale is slightly off), the prediction might still be okay! It's like if you guess the apple is 10% rotten when it's actually 15% rotten; your final average might still be close enough to be safe.
The Limitation: If your guess is too wrong, the scale tips, and your prediction fails.

Solution 2: The "Guess-and-Check" with a Safety Net (Uncertain Imputation - UI)

The Analogy: Instead of trying to weigh the rotten apples, imagine you just replace the rotten apples with a "fake" apple that looks like a healthy one, but you add a giant, fuzzy cloud around it to represent uncertainty.

How it works:

The Guess: You use the Privileged Information (the secret MRI) to guess what the missing diagnosis should have been.
The Safety Net: You don't just write down the guess. You say, "I think the diagnosis is X, but I'm not 100% sure." So, you add a "cloud of error" around X. This cloud represents all the possible things the diagnosis could be.
The Result: When you make your final prediction, you include this whole cloud. Because you made the cloud big enough to cover all possibilities, your prediction is guaranteed to be correct, even if your initial guess was slightly off.

Why it's cool: This method doesn't need to know the exact "weights" of the corruption. It just needs to be good at guessing the answer using the secret info, and then admitting it might be wrong by adding a safety buffer.

Solution 3: The "Triple-Redundancy" System (Triply Robust)

The Analogy: Imagine you are building a bridge. You want it to be safe even if one of your support beams breaks.

Beam 1: A standard prediction (works if the data is perfect).
Beam 2: The "Weighted Scale" (works if you can estimate the corruption rates).
Beam 3: The "Guess-and-Check with Cloud" (works if you can guess the answer using the secret info).

How it works:
The authors combine all three methods into one super-method called TriplyRobust.

They take the prediction from Beam 1, Beam 2, and Beam 3.
They combine them into one giant prediction set (the union of all three).
The Magic: As long as at least one of these three methods is doing its job correctly, the final result is guaranteed to be safe. It's like having three different navigators; even if two get lost, as long as one knows the way, you arrive safely.

Why This Matters

In the real world, data is rarely perfect.

Medical records often have missing diagnoses.
Financial data might have errors or hidden biases.
AI training often relies on data that was labeled by humans who made mistakes.

This paper gives us a toolkit to build AI that says, "I'm not 100% sure because the data is messy, but here is a range that I promise (with 90% certainty) contains the truth."

Summary of the "Magic"

PCP: Tries to fix the data by weighing the good parts more than the bad parts. It's surprisingly sturdy even if the weights aren't perfect.
UI: Replaces bad data with a "best guess" but wraps it in a giant safety blanket of uncertainty.
TriplyRobust: Combines everything. If any one of the three strategies works, the whole system works.

It's like having a backup plan for your backup plan, ensuring that even when the data is broken, your AI doesn't crash—it just gets a little wider and safer.

1. Problem Statement

The paper addresses the challenge of Uncertainty Quantification (UQ) in machine learning when training data contains corrupted labels (noisy or missing).

Context: Standard Conformal Prediction (CP) guarantees that prediction sets cover the true label with a specified probability (e.g., 90%) under the assumption that training and test data are exchangeable (i.i.d.).
The Issue: Corrupted labels (e.g., missing responses) induce a distribution shift between the observed training data and the test distribution. Naively applying CP to corrupted data or using only clean data leads to invalid coverage (under-coverage) because the calibration set no longer represents the test distribution.
The Setup: The authors consider a scenario where:
- $X$ : Observed features.
- $\tilde{Y}$ : Observed (potentially corrupted) labels.
- $M$ : Corruption indicator ( $M=1$ if corrupted/missing).
- $Z$ : Privileged Information (PI)—features available during training but unavailable at test time (e.g., annotator expertise, detailed clinical reports, or sensitive demographic data).
- Goal: Construct a prediction set $C(X_{test})$ such that $P(Y_{test} \in C(X_{test})) \geq 1 - \alpha$ , despite the label corruption and the unavailability of $Z$ at test time.

2. Methodology

The paper proposes two primary methods and a combined framework to handle this problem, analyzing their robustness to estimation errors.

A. Privileged Conformal Prediction (PCP) & Robustness Analysis

Background: PCP (Feldman & Romano, 2024) extends Weighted Conformal Prediction (WCP). It uses PI ( $Z$ ) to estimate importance weights $w(z) = \frac{P(M=0)}{P(M=0|Z=z)}$ to correct for the distribution shift caused by missing labels.
The Gap: Standard PCP assumes access to the true weights. In practice, these weights must be estimated, and estimation errors can invalidate the coverage guarantee.
Contribution (Robustness Analysis): The authors provide a theoretical analysis of PCP's robustness to inaccurate weights.
- They derive conditions under which PCP maintains valid coverage even when weights are shifted by a constant error $\delta$ or bounded general errors.
- Key Finding: PCP is surprisingly robust. If the "Naive CP" (using only observed data) already over-covers, PCP remains valid even with significant weight errors. If Naive CP under-covers, PCP requires the weight error to fall within a specific, narrow interval to remain valid.
- This analysis applies to both WCP and PCP.

B. Uncertain Imputation (UI)

Motivation: Since weight estimation can be difficult or inaccurate (especially when the corruption mechanism is complex), the authors propose a method that does not rely on weight estimation.
Core Idea: Instead of re-weighting, UI imputes the corrupted labels using the Privileged Information ( $Z$ ) but explicitly preserves the uncertainty of the imputation.
Algorithm:
1. Split data into Training, Calibration, and Reference sets.
2. Train a predictor $\hat{g}(X, Z)$ on the training set to estimate $Y$ using PI.
3. Compute residuals $E_i = Y_i - \hat{g}(X_i, Z_i)$ on the Reference Set (where labels are clean).
4. For corrupted labels in the Calibration Set, impute them as: $\bar{Y}_i = \hat{g}(X_i, Z_i) + E(Z_i)$ , where $E(Z_i)$ is a random sample from the reference residuals conditional on $Z$ .
5. Compute non-conformity scores using these "uncertain" imputed labels and construct the prediction set.
Theoretical Guarantee: Under the assumption that $(X, Y) \perp M | Z$ and that the predictor $\hat{g}$ is sufficiently accurate (with residuals independent of predictions given $Z$ ), UI guarantees valid marginal coverage. Crucially, this holds even if the weights used in PCP would be inaccurate.

C. Triply Robust Calibration (TriplyRobust)

Concept: To maximize reliability, the authors combine three approaches into an ensemble:
1. Naive CP: Valid if the model $\hat{f}$ is perfect (scores are exchangeable).
2. PCP: Valid if the corruption mechanism $P(M|Z)$ is accurately estimated.
3. UI: Valid if the label $Y$ can be accurately predicted from $Z$ (and residuals are independent).
Mechanism: The final prediction set is the union of the sets generated by all three methods: $C_{Triply} = C_{Naive} \cup C_{PCP} \cup C_{UI}$ .
Guarantee: The union achieves the desired coverage rate if at least one of the underlying assumptions holds. This provides a "safety net" against model misspecification.

3. Key Contributions

Robustness Theory for PCP/WCP: The first formal characterization of the conditions under which Weighted Conformal Prediction remains valid despite inaccurate weight estimation. The authors show that validity is maintained even with significant errors, provided the direction of the error aligns with the coverage behavior of the naive estimator.
Uncertain Imputation (UI): A novel method that bypasses weight estimation entirely. By imputing labels with noise sampled from a reference set conditioned on PI, it preserves the uncertainty required for valid conformal inference.
Triply Robust Framework: A unified scheme that guarantees validity if any one of three distinct conditions (perfect model, accurate corruption model, or accurate label prediction) is met.
Empirical Validation: Extensive experiments on synthetic and real-world datasets (MEPS, Facebook, Bio, House) demonstrating that:
- Naive CP fails under label corruption.
- PCP fails when weights are hard to estimate.
- UI succeeds in scenarios where PCP fails.
- TriplyRobust consistently achieves the target coverage (e.g., 90%) across all scenarios.

4. Results

Synthetic Experiments:
- Demonstrated that PCP maintains valid coverage even with shifted weights, confirming the theoretical bounds derived in Theorems 2 and 3.
- Showed that when weights are difficult to estimate (complex corruption mechanism), PCP fails to cover, while UI maintains the 90% coverage rate.
Real-World Benchmarks (MEPS, Facebook, etc.):
- In missing response setups, Naive CP and Naive Imputation (mean imputation) resulted in severe under-coverage (intervals were too narrow).
- PCP with estimated weights achieved valid coverage in some cases but was sensitive to estimation errors.
- UI consistently achieved the target 90% coverage with efficient interval lengths.
- TriplyRobust successfully combined the methods, ensuring validity without sacrificing too much statistical efficiency (interval width) compared to the best single method.
Causal Inference (NSLM Dataset): The methods were successfully applied to estimate uncertainty in individual treatment effects, a task known for complex distribution shifts.

5. Significance

Practical Reliability: In high-stakes applications (healthcare, finance), data is often noisy or incomplete. This paper provides a toolkit to ensure that uncertainty estimates remain statistically valid even when data quality is poor.
Theoretical Advancement: It moves beyond the "worst-case" analysis of distribution shifts, providing nuanced conditions for when re-weighting methods (PCP) remain robust.
New Paradigm: The introduction of Uncertain Imputation offers a fresh perspective: rather than trying to perfectly correct the distribution via weights, one can inject the correct amount of uncertainty back into the imputed labels to satisfy conformal guarantees.
Safety Net: The TriplyRobust approach is a significant contribution to the reliability of AI systems, ensuring that valid inference is possible even if the practitioner's assumptions about the data corruption or model accuracy are partially incorrect.

In summary, this paper bridges the gap between theoretical conformal prediction and the messy reality of corrupted training data, offering both rigorous theoretical bounds on weight estimation errors and a practical, robust alternative (UI) that does not rely on them.

Conformal Prediction with Corrupted Labels: Uncertain Imputation and Robust Re-weighting

The Big Picture: Predicting the Future with Broken Data

The Problem: The "Missing Ingredient" Dilemma

Solution 1: The "Weighted Scale" (Privileged Conformal Prediction - PCP)

Solution 2: The "Guess-and-Check" with a Safety Net (Uncertain Imputation - UI)

Solution 3: The "Triple-Redundancy" System (Triply Robust)

Why This Matters

Summary of the "Magic"

1. Problem Statement

2. Methodology

A. Privileged Conformal Prediction (PCP) & Robustness Analysis

B. Uncertain Imputation (UI)

C. Triply Robust Calibration (TriplyRobust)

3. Key Contributions

4. Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank