Non-Asymptotic Analysis of Efficiency in Conformalized Regression

Imagine you are a weather forecaster. You don't just want to say, "It will rain tomorrow." You want to say, "It will rain between 2 and 4 inches."

Conformal Prediction is the tool that helps you draw that box (the 2 to 4 inches) so you can be 95% sure the actual rain will fall inside it. But here's the catch: if your box is too wide (e.g., "It will rain between 0 and 100 inches"), your prediction is technically correct, but useless. You want the box to be as tight as possible while still being safe. This "tightness" is called efficiency.

This paper asks a very practical question: How do we make that box as tight as possible, and how does the amount of data we have affect the size of that box?

Here is the breakdown of the paper's findings using simple analogies.

1. The Two Buckets of Data

To build a good prediction box, you need two types of data:

The Training Bucket (Size $n$ ): This is where you teach the model how to predict. It learns the patterns.
The Calibration Bucket (Size $m$ ): This is where you test the model to see how "nervous" it is. You look at its past mistakes to decide how wide the safety box should be.

The Big Question: If you have 1,000 data points total, should you put 900 in Training and 100 in Calibration? Or 500 and 500? Or 100 and 900?

2. The "Miscoverage Level" (The Safety Margin)

The paper introduces a variable called $\alpha$ (alpha). Think of this as your "safety margin."

If $\alpha = 0.05$ , you want to be 95% sure the real value is in your box.
If $\alpha = 0.001$ , you want to be 99.9% sure.

The Trap: Most people think, "The smaller the $\alpha$ , the safer I am." But the paper shows that if you make $\alpha$ too small (demanding near-perfect certainty), your prediction box explodes in size. It becomes so wide it's useless.

3. The "Phase Transition" (The Tipping Point)

The authors discovered a "tipping point" in how data affects the box size. Imagine you are trying to fill a bucket with water (data) to reach a specific height (accuracy).

Scenario A: You have plenty of data, and you aren't being too picky.
If you ask for a reasonable safety margin (e.g., 95% certainty), adding more training data makes your box shrink nicely. Adding more calibration data also helps. It's a smooth, predictable relationship.
- Analogy: It's like walking on a flat road. The more steps you take (data), the closer you get to your destination.
Scenario B: You are being extremely picky (Tiny $\alpha$ ).
If you demand 99.99% certainty, the math changes completely. Suddenly, the "Calibration Bucket" becomes the bottleneck. Even if you have millions of training examples, if you don't have enough calibration examples to prove you are that safe, your box stays huge.
- Analogy: It's like trying to cross a river. If you just need to get across, a small boat works. But if you need to be 100% sure you won't get wet, you need a massive, heavy-duty ship. If you don't have enough wood (calibration data) to build that ship, you can't cross, no matter how good your swimming lessons (training data) were.

4. The "Sweet Spot" for Data Allocation

The paper provides a recipe for how to split your data.

If you want a standard safety level (e.g., 95%): You should split your data roughly 50/50 between training and calibration. Both buckets need to be big enough to do their jobs.
If you want extreme safety (e.g., 99.9%): You need to be very careful. The paper suggests that if you demand this level of certainty, you might need massive amounts of calibration data. If you don't have it, you shouldn't demand that level of certainty, or your prediction box will be so wide it covers the entire universe.

5. The "Oracle" (The Perfect Box)

The authors compare their method to an "Oracle"—a magical, all-knowing entity that knows the exact answer and draws the smallest possible box that still works.

Their math proves that as you get more data, their method gets closer and closer to this magical Oracle's box.
They also figured out exactly how fast it gets there. It turns out the speed depends heavily on that safety margin ( $\alpha$ ).

Summary: What Should You Do?

If you are building an AI system that needs to be safe (like for self-driving cars or medical diagnosis):

Don't be greedy with safety: Don't demand 99.99% certainty unless you have a massive amount of data. It will make your predictions too vague to be useful.
Balance your buckets: Don't dump all your data into "learning" and ignore "testing." You need a healthy amount of data just to measure how uncertain your model is.
Watch the "Elbow": The paper found a specific point (an "elbow") where asking for a tiny bit more safety causes the prediction box to suddenly get huge. Stay on the safe side of that elbow.

In a nutshell: This paper gives you a map to stop guessing how much data you need. It tells you that if you want a tight, useful prediction box, you need to balance your training and testing data, and you need to pick a safety level that matches the amount of data you actually have.

Here is a detailed technical summary of the paper "Non-Asymptotic Analysis of Efficiency in Conformalized Regression."

1. Problem Statement

Conformal Prediction (CP) is a distribution-free framework that provides prediction sets with rigorous finite-sample coverage guarantees. While the validity (coverage probability $\ge 1-\alpha$ ) of CP is well-understood, its efficiency (the size or length of the prediction set) is less understood in non-asymptotic settings.

The Gap: Prior work on the efficiency of conformalized regression often treats the miscoverage level $\alpha$ $α$ as a fixed constant or relies on asymptotic analysis (as $n, m \to \infty$ $n, m \to \infty$ ). There is a lack of theoretical bounds that explicitly characterize how the excess prediction set length (the deviation from the optimal "oracle" interval) depends jointly on:
1. The size of the proper training set ( $n$ ).
2. The size of the calibration set ( $m$ ).
3. The miscoverage level ( $\alpha$ ).
Goal: To establish non-asymptotic bounds on the expected length deviation for Conformalized Quantile Regression (CQR) and Conformalized Median Regression (CMR) trained via Stochastic Gradient Descent (SGD), explicitly accounting for the interplay between $n$ , $m$ , and $\alpha$ .

2. Methodology

The authors analyze two specific conformalized regression methods under a linear model setting trained with SGD:

A. Conformalized Quantile Regression (CQR)

Setup: Estimates conditional quantiles $q_{\alpha/2}(Y|X)$ and $q_{1-\alpha/2}(Y|X)$ using the pinball loss.
Nonconformity Score: $S = \max(t_{\alpha/2}(X) - Y, Y - t_{1-\alpha/2}(X))$ .
Prediction Set: An adaptive, asymmetric interval centered around the estimated quantiles, expanded by the empirical quantile of the calibration scores.

B. Conformalized Median Regression (CMR)

Setup: Estimates the conditional median $q_{1/2}(Y|X)$ using absolute error loss (a special case of pinball loss).
Nonconformity Score: $S = |t_{1/2}(X) - Y|$ .
Prediction Set: A symmetric interval centered at the median, expanded by the empirical quantile of the absolute residuals.
Assumption: Assumes homoscedasticity or symmetric quantile spacing (Assumption 4.2) to derive bounds.

C. Theoretical Framework

Assumptions:
- Well-specification: The true conditional quantiles lie within the linear model class.
- Bounded Covariance: The input covariance matrix is well-conditioned.
- Regularity of Density: The conditional density $f_{Y|X}$ is bounded away from zero and infinity ($0 < f_{min} \le f_{Y|X} \le f_{max}$).
Analysis Strategy:
1. Derive bounds on the parameter estimation error of SGD-trained quantile regressors (Theorem 3.1).
2. Analyze the population quantile of the nonconformity score relative to the true oracle interval.
3. Bound the finite-sample gap between the population quantile and the empirical quantile on the calibration set (using Dvoretzky–Kiefer–Wolfowitz inequalities).
4. Combine these errors to bound the total expected length deviation: $E[||C(X)| - |C^*(X)||]$ .

3. Key Contributions

1. Non-Asymptotic Efficiency Bounds

The paper derives the first explicit upper bounds on the expected length deviation for CQR and CMR as a function of $n$ , $m$ , and $\alpha$ . The bound is of the order:
$O\left( \frac{1}{\sqrt{n}} + \frac{1}{\alpha^2 n} + \frac{1}{\sqrt{m}} + e^{-\alpha^2 m} \right)$

Significance: Unlike previous works that treated $\alpha$ as a constant, this bound reveals the critical scaling relationship where small $\alpha$ significantly degrades efficiency if $n$ and $m$ are not sufficiently large.

2. Identification of Phase Transitions

The analysis identifies distinct regimes for $\alpha$ :

Large $\alpha$ Regime: If $\alpha = \Omega(n^{-1/4})$ and $\alpha = \Omega(\sqrt{\log m / m})$ , the bound scales as $O(n^{-1/2} + m^{-1/2})$ . Here, the standard statistical convergence rates dominate.
Small $\alpha$ Regime: If $\alpha$ is very small (e.g., $\alpha \ll n^{-1/4}$ ), the terms $\frac{1}{\alpha^2 n}$ and $e^{-\alpha^2 m}$ dominate. The convergence rate slows down drastically, and the prediction set length can explode if data is not allocated carefully.

3. Data Allocation Guidance

The theoretical results provide a concrete strategy for splitting data between training and calibration:

To maintain a fixed excess length tolerance, $\alpha$ cannot be chosen arbitrarily small without increasing $n$ and $m$ .
Optimal Split: For moderate $\alpha$ , balancing $n$ and $m$ (e.g., $n \approx m$ ) is optimal. However, for very small $\alpha$ , the trade-off shifts, suggesting specific ratios (e.g., $m = \Theta(\alpha^4 n^4)$ ) to balance the competing error terms.

4. Generalization to Optimizers

While the proofs focus on SGD, the framework is modular. The bounds extend to other optimizers (e.g., Adam, Momentum) by simply substituting the specific convergence rate of the chosen optimizer into the $n$ -dependent terms.

4. Results

Theoretical Findings

Theorem 3.2 (CQR) & Theorem 4.1 (CMR): Both theorems confirm the $O(1/\sqrt{n} + 1/(\alpha^2 n) + 1/\sqrt{m} + \exp(-\alpha^2 m))$ bound.
Phase Transitions: The paper visualizes how the dominant term in the error bound shifts from $O(1/\sqrt{n})$ to $O(1/(\alpha^2 n))$ as $\alpha$ decreases, creating a "phase transition" in convergence behavior.

Empirical Validation

Experiments on synthetic and real-world datasets (MEPS, California Housing, Abalone, etc.) validate the theory:

Log-Log Slopes: When plotting length deviation $\Delta$ vs. training size $n$ , the slope transitions from $-1$ (dominated by $1/(\alpha^2 n) $) to$ -0.5 $(dominated by$ 1/\sqrt{n} $) as$ \alpha$ increases.
$\alpha$ Dependence: The intercept of the regression scales approximately as $\alpha^{-2}$ , confirming the theoretical dependency.
Calibration Size: Increasing $m$ reduces deviation with a slope of $\approx -0.5$ , consistent with the $1/\sqrt{m}$ term.
Optimizer Robustness: The phase transition phenomenon persists across different optimizers (SGD, Momentum, AdamW) and even non-linear models, suggesting the insights are robust beyond the linear/SGD assumptions.
Data Allocation: Experiments show a "U-shaped" curve for efficiency vs. training ratio, confirming that a balanced split is generally best, but extremely small $\alpha$ requires careful handling to avoid trivial (very large) prediction intervals.

5. Significance and Impact

Bridging Theory and Practice: This work moves beyond asymptotic guarantees to provide finite-sample, non-asymptotic guidance. This is crucial for safety-critical applications (healthcare, finance) where data is limited and $\alpha$ is often set to very low values (e.g., 0.01 or 0.05).
Practical Guidelines: It offers practitioners a quantitative rule for data allocation. Instead of arbitrarily splitting data (e.g., 80/20), users can now calculate the optimal $n/m$ ratio based on their desired $\alpha$ to minimize prediction set width.
Understanding "Small $\alpha$ ": The paper highlights a fundamental limitation: achieving very high confidence (very small $\alpha$ ) with conformal prediction is expensive in terms of data requirements. If $\alpha$ is too small relative to $n$ , the prediction sets become uninformative (too wide).
Framework Flexibility: By decoupling the optimization error from the conformal analysis, the results apply to a wide range of modern deep learning optimizers, making the theoretical insights relevant to contemporary machine learning pipelines.

In summary, this paper provides the first rigorous, non-asymptotic characterization of how the miscoverage level $\alpha$ interacts with data sizes to determine the efficiency of conformalized regression, offering both theoretical bounds and practical heuristics for deploying reliable uncertainty quantification.