Co-optimization for Adaptive Conformal Prediction

Imagine you are trying to predict the weather for tomorrow. You want to give your friends a forecast that is reliable (it actually rains when you say it will) but also useful (you don't just say "it might rain or snow or be sunny," you want to give a specific range).

In the world of data science, this is called Conformal Prediction. It's a method to draw a "safety net" (a prediction interval) around a guess. If you say, "The temperature will be between 60°F and 80°F," you want to be 90% sure the real temperature falls in that range.

The Problem: The "Equal-Tailed" Mistake

Most current methods use a simple, rigid rule to draw this safety net. They assume the weather is symmetrical, like a bell curve. They say, "Okay, I'll cut off the bottom 5% of possibilities and the top 5%."

The Analogy: Imagine you are trying to fit a suitcase into a car trunk.

The Old Way (CQR): You assume the trunk is a perfect rectangle. You measure 5 inches from the left wall and 5 inches from the right wall, then close the lid.
The Reality: The trunk is actually shaped weirdly. Maybe the left side is deep and full of space, but the right side is squished by the spare tire.
The Result: By measuring equally from both sides, your suitcase (the prediction interval) ends up being way too big because you're including a lot of empty space on the left just to balance the tight squeeze on the right. You are safe, but you are wasting space.

This happens when data is "skewed" (lopsided). The old methods force the interval to be centered on the average, even if the "crowded" part of the data (where the truth is most likely to be) is off to one side.

The Solution: CoCP (The "Smart Suitcase")

The authors of this paper propose a new method called CoCP (Co-optimization for Adaptive Conformal Prediction). Instead of using a rigid ruler, CoCP acts like a smart, shape-shifting suitcase that learns the exact shape of the trunk.

Here is how it works, using a simple metaphor:

1. The "Folded Flag" Trick

Imagine the data distribution is a flag hanging on a pole.

The Old Way: You try to grab the flag from the left and right edges equally.
CoCP's Way: It takes the flag, folds it in half over the pole (the center), and looks at the combined thickness. It realizes, "Hey, the left side of the flag is much thicker (denser) than the right side."

2. The "Push and Pull" Dance

CoCP doesn't just guess the center; it learns it through a two-step dance:

Step A (The Radius): It asks, "How wide do I need to be to catch 90% of the flag?" It measures the folded flag.
Step B (The Center): It looks at the edges of its current guess. If the left edge is in a "thick" part of the flag and the right edge is in a "thin" part, CoCP says, "I'm off-center! I need to push my center toward the thick part."

Why? Because if you move the center toward the thick part, you can shrink the width of the suitcase while still catching the same 90% of the flag. You are squeezing out the empty space.

3. The "Soft Touch"

How does it know where the "thick" part is without seeing the whole flag? It uses a "soft window." Imagine a flashlight that only shines brightly on the very edges of your suitcase. If the light hits a dense crowd of people on the left edge, the flashlight pushes the suitcase to the right. If it hits a sparse crowd on the right, it pulls the suitcase to the left. It's a gentle, continuous nudge until the suitcase is perfectly centered on the crowd.

The Result: Shorter, Smarter Intervals

By doing this "co-optimization" (learning the center and the width at the same time), CoCP achieves two things:

It stays safe: It still guarantees that 90% of the time, the truth is inside the box (just like the old methods).
It gets tighter: Because it moves the box to the "high-density" area, the box doesn't need to be as wide.

In everyday terms:
If you are predicting house prices in a city where most houses are cheap, but a few are mansions:

Old Method: "The price will be between $100k and $1M." (Safe, but the $1M part is mostly empty space).
CoCP: "The price will be between $150k and $400k." (Still 90% safe, but much more useful because it focuses on where the houses actually are).

Why This Matters

This paper shows that by treating the prediction interval as a flexible object that can slide (translate) and stretch (scale) simultaneously, we can get much better predictions, especially when the data is messy, lopsided, or unpredictable. It's the difference between using a generic, one-size-fits-all box and a custom-molded container that fits the data perfectly.

1. Problem Statement

Conformal Prediction (CP) is a standard framework for constructing prediction sets with finite-sample, distribution-free marginal coverage guarantees. However, standard conformal regression methods, particularly Conformalized Quantile Regression (CQR), suffer from inefficiency under heteroscedasticity and skewness.

The Core Issue: CQR constructs intervals by enforcing equal-tailed errors (e.g., $\alpha/2$ on each side). Under skewed conditional distributions, this rigid constraint forces the interval to be centered on the conditional mean or median, displacing it away from the high-density regions of the distribution.
The Consequence: To maintain the required coverage mass ( $1-\alpha$ ), the interval must expand unnecessarily, resulting in wider prediction intervals than theoretically necessary. The theoretical lower bound for interval length is the Highest Density Interval (HDI), which adapts its center and width to the probability mass concentration.
The Challenge: Existing methods either fail to correct the center (CQR) or require complex estimation of the full conditional density (which is difficult and prone to errors) to find the HDI.

2. Methodology: CoCP

The authors propose CoCP (Co-optimization for Adaptive Conformal Prediction), a framework that learns prediction intervals by jointly optimizing a center function $m(x)$ and a radius function $h(x)$ . The method is grounded in a novel geometric insight called "folded-flag" visualization.

A. Geometric Insight: Folded Residuals

The authors conceptualize the problem by "folding" the conditional distribution around a candidate center $m$ .

Instead of looking at the two-sided interval $[m-h, m+h]$ , they consider the distribution of the absolute residual $|Y - m|$ .
The minimal radius required to capture $1-\alpha$ mass at a fixed center $m$ is the $(1-\alpha)$ -quantile of this folded distribution.
The "Push-Pull" Dynamic: If the densities at the two endpoints ( $m-h$ and $m+h$ ) are imbalanced, shifting the center $m$ toward the higher-density side pushes more probability mass into the interval. Consequently, the required radius $h$ can contract to maintain the exact $1-\alpha$ mass. This process continues until the endpoint densities are balanced, which corresponds to the HDI.

B. The Co-optimization Algorithm

CoCP operationalizes this insight through an alternating optimization procedure, followed by a conformal calibration step:

Radius Update (Scaling):
- Fix the current center $m(x)$ .
- Update the radius $h(x)$ by performing Quantile Regression on the folded residuals $|Y - m(x)|$ .
- This minimizes the pinball loss to estimate the $(1-\alpha)$ -quantile of the folded distribution.
Center Update (Translation):
- Fix the current radius $h(x)$ .
- Refine the center $m(x)$ using a differentiable soft-coverage objective.
- Instead of a hard indicator function, they use a sigmoid-based surrogate: $\sigma((h - |Y-m|)/\beta)$ .
- Key Mechanism: The gradient of this objective concentrates near the interval boundaries. If the densities at the boundaries are imbalanced, the gradient naturally pushes $m(x)$ toward the denser region without needing to estimate the full conditional density $f(y|x)$ .
Conformal Calibration:
- After training $m(x)$ and $h(x)$ , a standard split-conformal calibration is performed on a held-out calibration set.
- A nonconformity score $S_i = |Y_i - m(X_i)| / h(X_i)$ is computed.
- A quantile $\hat{q}$ is selected from these scores to scale the final interval: $[\hat{m}(x) - \hat{q}\hat{h}(x), \hat{m}(x) + \hat{q}\hat{h}(x)]$ .
- This step guarantees finite-sample marginal validity regardless of the learning errors in $m$ and $h$ .
Implementation Details:
- Uses K-fold cross-fitting and ensembling to reduce variance and improve data efficiency.
- $m(x)$ and $h(x)$ are typically instantiated as neural networks (MLPs).

3. Key Contributions

Theoretical Foundation: The paper establishes that the HDI can be recovered by solving a boundary-balancing problem in a folded geometry. It proves that under regularity conditions, CoCP asymptotically approaches the optimal HDI length as the learning error and the smoothing parameter $\beta$ vanish.
Novel Optimization Strategy: Unlike methods that estimate full densities or use complex discretization (like histograms), CoCP only requires learning a single conditional quantile (for the radius) and uses local gradient signals to adjust the center. This bypasses the "curse of dimensionality" associated with full density estimation.
Rigorous Guarantees:
- Finite-sample marginal validity: Guaranteed by the split-conformal calibration.
- Asymptotic optimality: The interval length converges to the oracle HDI length.
- Conditional reliability: The method significantly reduces conditional coverage gaps (miscoverage in specific regions of the covariate space).

4. Experimental Results

The authors evaluated CoCP on synthetic benchmarks (Normal, Log-Normal, and Exponential noise) and seven real-world regression datasets (e.g., bike sharing, house prices, superconductivity).

Synthetic Data:
- Under skewed distributions (Log-Normal, Exponential), CoCP significantly outperformed CQR and other baselines.
- Length Reduction: CoCP reduced interval length by ~13% (Log-Normal) and ~20% (Exponential) compared to CQR.
- Conditional Accuracy: CoCP achieved the lowest ConMAE (Conditional Mean Absolute Error), indicating it successfully aligned the interval with the high-density regions, whereas CQR remained mis-centered.
Real-World Data:
- CoCP achieved the shortest average interval length on 5 out of 7 datasets while maintaining nominal coverage.
- On datasets where other methods (like CHR) had slightly shorter lengths, CoCP demonstrated superior conditional reliability (lower MSCE, WSC, and ERT metrics), proving it avoids localized under-coverage.
- It consistently achieved state-of-the-art performance on conditional coverage diagnostics (e.g., $\ell_1$ -ERT).

5. Significance and Impact

Bridging Theory and Practice: CoCP provides a practical, differentiable method to approximate the theoretically optimal HDI without the computational burden of density estimation or sampling.
Solving the Skewness Problem: It directly addresses the inefficiency of equal-tailed intervals in skewed settings, a common issue in real-world data (e.g., financial returns, biological measurements).
Efficiency-Reliability Trade-off: The method demonstrates that one does not have to sacrifice marginal validity to gain efficiency; by co-optimizing geometry, CoCP achieves both tighter intervals and better conditional coverage.
Future Directions: The paper suggests extending this co-optimization framework to multivariate outputs, though it notes the geometric challenges regarding volume preservation in higher dimensions.

In summary, CoCP represents a significant advancement in conformal regression by introducing a geometric, gradient-based approach to centering prediction intervals, resulting in shorter, more reliable, and theoretically grounded prediction sets.