Cross-Domain Uncertainty Quantification for Selective Prediction: A Comprehensive Bound Ablation with Transfer-Informed Betting

Imagine you have a very smart, but expensive, personal assistant (like a high-end AI). You want to use this assistant to answer questions, but calling it every time costs money and takes time. So, you decide to build a cache: a shortcut where you save the answers to common questions (like "What's the weather?") and just serve those saved answers instead of calling the expensive AI.

The Problem:
What if your shortcut gets it wrong? If you tell your smart home to "turn off the lights" but the shortcut accidentally thinks you said "turn on the oven," and it executes the wrong command, that's a disaster. You need a way to know when it is safe to use the shortcut and when you should call the expensive AI to double-check.

This paper is about building a safety certificate for that shortcut. It answers the question: "How many times do I need to test my shortcut before I can trust it to work on its own?"

Here is the breakdown of their solution using simple analogies:

1. The "Betting" Strategy (The Core Innovation)

Most old methods for checking safety are like a strict accountant who assumes the worst-case scenario every single time. They say, "I don't know how good your shortcut is, so I'll assume it fails 50% of the time until I have a million test results." This is safe, but it means you can't use your shortcut for a very long time because you need too much data to feel safe.

The authors introduce a new method called "Transfer-Informed Betting."

The Analogy: Imagine you are learning to play poker.
- Old Way (Cold Start): You sit at a table with no idea what the cards are. You bet very cautiously, losing money slowly as you figure out the rules.
- New Way (Transfer-Informed): You sit at a table, but before you start, a friend who played at a similar table tells you, "Hey, in this game, the dealer usually deals low cards." You use that tip to start betting smarter immediately.
In the Paper: They use data from a "Source Domain" (a big, well-tested dataset of general questions) to give their "shortcut" a head start on a "Target Domain" (a new, smaller dataset of specific questions). It's like giving the shortcut a cheat sheet based on what it learned elsewhere, so it needs far fewer tests to prove it's safe.

2. The "Monotone" Test (The LTT Method)

The paper also compares different ways of running the safety tests.

The Analogy: Imagine you are trying to find the highest safe speed for a new car.
- The "Union Bound" (Old Way): You test 100 different speeds (10mph, 20mph... 100mph). To be safe, you have to be extremely careful with every single test, which makes your final speed limit very low.
- The "LTT" Method (New Way): You start at the slowest speed and work your way up. If the car handles 10mph perfectly, you don't need to be as paranoid about 20mph. You only spend your "safety budget" once, not 100 times.
Result: This allows the system to be much more aggressive (faster/more useful) while staying just as safe.

3. The "Coverage" vs. "Safety" Trade-off

The paper measures how many questions the shortcut can answer safely.

The Result: On a standard dataset, the old methods said, "You can only answer 74% of questions safely." The new methods said, "You can answer 94% safely!"
Why it matters: That extra 20% means your AI assistant saves a lot more money and time because it doesn't have to call the expensive "supervisor" AI as often.

4. The "Progressive Trust" Model

This is the most practical part for real life. The paper suggests we shouldn't just flip a switch from "Unsafe" to "Safe." Instead, we should have Levels of Trust:

Level 0 (No Data): The shortcut is useless. Every question goes to the expensive AI.
Level 1 (Some Data): The shortcut is "Semi-Autonomous." It can handle easy questions, but if it's unsure, it asks the AI.
Level 2 (Lots of Data): The shortcut is "Fully Autonomous." It handles almost everything on its own.

The math in the paper proves exactly how much data you need to move from Level 0 to Level 1, and from Level 1 to Level 2. It turns "trust" from a vague feeling into a hard number.

5. Why Not Just Use "Prediction Sets"?

The paper also explains why they didn't use a popular alternative method called "Conformal Prediction."

The Analogy:
- Conformal Prediction: When you ask "What's the weather?", it says, "It's either Sunny, Cloudy, or Rainy." (It gives you a list of 3 possibilities).
- Selective Prediction (This Paper): When you ask "What's the weather?", it says, "It's Sunny," and gives you a guarantee that it's 95% sure.
Why it matters: If you are controlling a robot or a smart home, you can't say "Maybe turn on the lights, maybe turn on the AC." You need a single, definite answer. This paper provides the math to give you that single, safe answer.

Summary

This paper is a rulebook for building safe, cheap AI shortcuts.

It uses betting strategies to learn faster.
It uses transfer learning (borrowing knowledge from similar tasks) to start with a head start.
It provides a mathematical guarantee that tells you exactly when your shortcut is safe enough to run on its own without human (or expensive AI) supervision.

In short: It helps you build a smarter, cheaper, and safer AI assistant that knows exactly when it's confident enough to do the job alone.

Here is a detailed technical summary of the paper "Cross-Domain Uncertainty Quantification for Selective Prediction: A Comprehensive Bound Ablation with Transfer-Informed Betting."

1. Problem Formulation

The paper addresses the safety-critical challenge of semantic caching in personal AI agents (e.g., voice assistants, productivity bots).

The Context: Agents cache intent-classification outputs to avoid calling expensive Large Language Models (LLMs) for repetitive queries.
The Risk: An "unsafe cache hit" occurs when a classifier misclassifies a query, but the system serves the cached response anyway. This can lead to incorrect actions (e.g., turning off the wrong device or executing a wrong financial transaction).
The Goal: Implement Selective Prediction, where the system serves a cached response only if the classifier's confidence exceeds a threshold $\tau$ . The objective is to find the optimal $\tau$ that maximizes coverage (fraction of queries served from cache) while guaranteeing that the unsafe rate (risk) remains below a tolerance $\alpha$ with high probability ($1-\delta $), using only a finite number of calibration examples ($ n$).
The Challenge: Standard methods often rely on loose concentration inequalities (like Hoeffding) combined with union bounds over many thresholds, resulting in overly conservative thresholds that drastically reduce coverage, especially in data-scarce scenarios.

2. Methodology

The authors conduct a comprehensive ablation of nine families of finite-sample bounds to determine the tightest risk-control mechanisms. These methods combine different concentration inequalities with multiple-testing corrections.

A. Bound Families Evaluated

Hoeffding + Union Bound: The standard baseline. Distribution-free but pays a heavy penalty ( $\ln K$ ) for testing $K$ thresholds.
Empirical Bernstein + Union Bound: Tighter than Hoeffding when the loss variance is low (accurate classifiers).
LTT (Learn Then Test) Fixed-Sequence: Exploits the monotonicity of risk (risk decreases as $\tau$ increases). By testing thresholds in decreasing order, it eliminates the $\ln K$ union-bound penalty, spending the full $\delta$ budget on each test.
Clopper-Pearson + LTT: Uses the exact binomial distribution for binary losses, providing tighter bounds than asymptotic approximations when empirical risk is low.
Betting-Based Bounds (WSR): Uses the Wealth-process Sequential Ratio (WSR) method. Instead of static inequalities, it constructs a supermartingale wealth process that "bets" against candidate risk values. This adapts to the observed data distribution.
Wasserstein DRO & CVaR: Distributionally Robust Optimization and Conditional Value at Risk. These are strictly more conservative, designed to handle distribution shifts rather than optimize for i.i.d. tightness.
PAC-Bayes Transfer: Uses a data-rich source domain as a prior for a data-scarce target domain.

B. Core Innovation: Transfer-Informed Betting (TIB)

The paper introduces Transfer-Informed Betting, a novel method combining betting-based confidence sequences with cross-domain transfer.

Mechanism: Standard WSR betting suffers from a "cold start" (initial estimates are uninformative). TIB warm-starts the wealth process using the risk profile (mean and variance) from a source domain.
Mathematical Formulation: It blends the source domain's empirical risk ( $\hat{R}_{source}$ ) and variance with the target's running estimates using a decaying weight $w_t = n_{eff} / (n_{eff} + t)$ .
Theoretical Guarantees:
- Validity: The wealth process remains a valid supermartingale under all source-target divergences.
- Dominance: When source and target distributions match, TIB strictly dominates standard WSR (tighter bounds).
- Optimality: The authors prove that no data-independent warm-start can achieve better convergence; the source-informed initialization is optimal among plug-in priors.
- Formal Verification: The core theorems were machine-checked in Lean 4 using the Mathlib library.

3. Key Contributions

Systematic Ablation: A rigorous comparison of nine bound families across four benchmarks (MASSIVE, NyayaBench v2, CLINC-150, Banking77) and 18 $(\alpha, \delta)$ configurations.
Transfer-Informed Betting (TIB): A novel theoretical framework for cross-domain selective prediction that provides formal dominance guarantees and optimality results.
LTT Dominance: Demonstration that LTT fixed-sequence testing is the single most effective improvement, eliminating the $\ln K$ penalty and significantly boosting coverage.
Conformal vs. Selective Distinction: A rigorous comparison showing that while Conformal Prediction guarantees the true class is in a set, Selective Prediction (RCPS) guarantees the risk of a single prediction. The paper argues RCPS is essential for caching systems that must commit to a single action.
Progressive Trust Model: Formalizing how AI agents can graduate from "LLM-supervised" to "autonomous" execution as calibration data accumulates and bounds tighten.

4. Experimental Results

The methods were evaluated on four datasets ranging from 280 to 22,500 examples.

MASSIVE (Large Data, $n=1,102$ ):
- LTT + Hoeffding achieved 94.0% coverage at $\alpha=0.10$ , compared to 73.8% for standard Hoeffding + Union (a 27% relative improvement).
- WSR Betting + LTT achieved the tightest non-transfer bounds (96.0% coverage).
- Clopper-Pearson + LTT matched the performance of Empirical Bernstein, confirming the benefit of exact binomial bounds for low-risk scenarios.
NyayaBench v2 (Small Data, $n=280$ ):
- Standard methods (Hoeffding/Bernstein) failed to find feasible thresholds for $\alpha < 0.20$ due to the small sample size.
- Transfer-Informed Betting achieved 18.5% coverage at $\alpha=0.10$ , a 5.4x improvement over LTT + Hoeffding (3.4%).
- This demonstrates that cross-domain transfer is not just a "tighter bound" but the difference between having any guarantee and having none in data-scarce regimes.
Calibration Sensitivity:
- The study highlights that while bounds are distribution-free, classifier calibration (via temperature scaling) is crucial for practical utility. Poor calibration compresses confidence scores, making the risk-coverage curve steep and limiting feasible operating points.
Conformal vs. Selective:
- At $\alpha=0.10$ , Conformal Prediction on MASSIVE produced prediction sets with an average size of 1.67 classes. Selective Prediction provided a single prediction with a guaranteed risk of 6.3%. The paper argues single-prediction guarantees are necessary for autonomous caching.

5. Significance and Implications

Practical Deployment Recipe: The paper provides clear guidelines for practitioners:
- Large datasets ( $n \gtrsim 500$ ): Use WSR Betting + LTT or LTT + Empirical Bernstein.
- Small datasets with source data ( $n \lesssim 200$ ): Use Transfer-Informed Betting.
- Small datasets without source data: Use PAC-Bayes-λ.
Formalizing Trust: The work establishes a mathematical foundation for "Progressive Trust" in agentic systems. It allows systems to formally declare when they have accumulated enough data to safely reduce human/LLM supervision.
Theoretical Advancement: By proving the optimality of source-informed warm-starts and verifying theorems in Lean 4, the paper bridges the gap between theoretical statistics (martingales, PAC-Bayes) and practical AI safety engineering.

In summary, this paper demonstrates that by combining monotone testing (LTT), adaptive betting (WSR), and cross-domain transfer (TIB), AI systems can achieve significantly higher coverage and safety guarantees than previously possible, enabling more efficient and autonomous deployment of personal AI agents.