Symmetric Aggregation of Conformity Scores for Efficient Uncertainty Sets

Imagine you are trying to predict the weather for next week. Instead of asking just one meteorologist, you ask ten of them. Each one gives you a forecast, but they also give you a "confidence score" (e.g., "I'm 90% sure it will rain" vs. "I'm only 50% sure").

In the world of Artificial Intelligence, we often do the same thing: we use multiple AI models to make predictions. But here's the problem: How do you combine their confidence scores to give you a single, reliable answer that isn't too vague?

If you just take the average, you might get a prediction that is too wide (e.g., "It will rain somewhere between 1 PM and 11 PM"), which isn't very helpful. If you take the most confident model, you might be wrong if that model is overconfident.

This paper introduces a new method called SACP (Symmetric Aggregated Conformal Prediction) to solve this puzzle. Here is how it works, explained simply:

1. The Problem: The "Confidence" Mismatch

Imagine your ten meteorologists are speaking different languages.

Meteorologist A says, "My confidence is a 9."
Meteorologist B says, "My confidence is a 0.9."
Meteorologist C says, "My confidence is a -5."

Even though they might mean similar things, you can't just add them up or average them because their scales are different. In AI, this is called having different "scales" or "distributions." Traditional methods often struggle to mix these scores fairly without losing information or making the final prediction too wide.

2. The Solution: The "Universal Translator" (E-Values)

The authors' first big idea is to translate all these different confidence scores into a common language.

They use a mathematical trick to turn every model's score into something called an e-value. Think of an e-value like a standardized currency.

Before: Meteorologist A has 9 dollars, B has 0.9 euros, C has -5 yen.
After SACP: Everyone is converted to "Confidence Coins," where the average value is always 1.

Now, no matter which model you ask, their confidence is measured on the exact same scale. This allows you to compare them fairly.

3. The Aggregation: The "Symmetric Team Huddle"

Once everyone is speaking the same language (using e-values), SACP asks them to huddle up and agree on a final prediction.

The key word here is "Symmetric." Imagine a round table where every meteorologist sits in a circle. It doesn't matter who sits where; the group's decision depends only on what they say, not who says it.

If you swap Meteorologist A and B, the final result is exactly the same.
This ensures that no single model is accidentally favored just because of how the computer listed them.

The method then uses a flexible "aggregation function" (a mathematical rule) to combine these standardized scores. You can choose a rule that is very strict (only accepting if everyone agrees) or more lenient (accepting if most agree), depending on how much risk you are willing to take.

4. The Result: Sharper, Smarter Predictions

The goal of this whole process is Efficiency. In AI, a "prediction set" is like a target you draw on a board.

Inefficient: Drawing a giant circle that covers the whole board. You are definitely right (100% coverage), but you didn't really tell us anything useful.
Efficient: Drawing a tiny bullseye. If you are right, it's very useful.

SACP manages to draw smaller, tighter bullseyes than previous methods while still guaranteeing that the true answer is inside the circle. It does this by:

Standardizing the inputs so they can be compared fairly.
Symmetrically combining them so no bias is introduced.
Adapting the combination rule to find the "sweet spot" between being safe and being precise.

The "SACP++" Upgrade

The paper also introduces SACP++, which is like the "Pro" version.

SACP uses a standard rule to combine the scores (like a simple average).
SACP++ looks at the data and asks, "Hey, which specific rule would have given us the smallest prediction set for this specific problem?" It automatically picks the best rule to make the prediction as tight as possible without breaking the safety guarantees.

Why This Matters

In high-stakes situations—like diagnosing a disease, predicting stock market crashes, or driving a self-driving car—you need to know not just what will happen, but how sure the AI is.

This paper gives us a better way to listen to a "committee" of AIs. Instead of getting a muddy, vague answer, SACP helps us get a clear, precise, and trustworthy answer, ensuring we don't miss the target while keeping the safety net strong.

In short: SACP is a translator and a team leader that helps multiple AI models work together to give you a sharper, more accurate prediction without losing the guarantee that they are right.

Here is a detailed technical summary of the paper "Symmetric Aggregation of Conformity Scores for Efficient Uncertainty Sets" (SACP).

1. Problem Statement

In high-stakes AI applications (e.g., healthcare, finance), reliable uncertainty quantification is critical. Conformal Prediction (CP) provides a distribution-free framework to generate prediction sets with finite-sample coverage guarantees. However, a significant challenge arises when multiple predictive models are available for the same task:

The Gap: While individual models can generate valid prediction sets, simply combining them (e.g., via intersection or union) often leads to suboptimal results. Intersections may lose coverage guarantees, while unions produce overly large (inefficient) sets.
The Goal: The objective is to aggregate the outputs of $K$ base predictors into a single, unified prediction set that maintains the exact marginal coverage guarantee ($1-\alpha$) while minimizing the expected size (length) of the set (efficiency).
Limitations of Existing Methods: Current aggregation strategies often rely on:
- Set-level aggregation: Merging final prediction sets (e.g., majority vote), which ignores the underlying score distribution.
- Score-level aggregation: Combining nonconformity scores (NCS) directly, but often requiring complex data splitting, specific hyperparameters, or failing to fully utilize available data.

2. Methodology: SACP

The authors propose Symmetric Aggregated Conformal Prediction (SACP), a novel framework that aggregates normalized conformity scores from multiple predictors using e-values and symmetric functions.

Core Steps:

E-value Transformation:
- Instead of using raw nonconformity scores ( $s^{(k)}$ ), SACP transforms them into e-variables (random variables with an expectation $\le 1$ under the null hypothesis).
- For a candidate label $y$ and predictor $k$ , the e-variable is defined as:
  $E^{(k)}_i(y) = \frac{s^{(k)}(X_i, Y_i)}{\frac{1}{n+1} \left( \sum_{j=1}^n s^{(k)}(X_j, Y_j) + s^{(k)}(X_{test}, y) \right)}$
- This normalization ensures all scores from different models share a common first moment (expected value of 1), making them comparable regardless of their original scale or distribution.
Symmetric Aggregation:
- The e-variables from $K$ models are combined using an arbitrary symmetric function $f: \mathbb{R}^K \to \mathbb{R}$ .
- Symmetry ensures the result is invariant to the ordering of the models (permuting model indices does not change the output).
- The aggregated score for a calibration point $i$ is $F_i(y) = f(E^{(1)}_i(y), \dots, E^{(K)}_i(y))$ .
Prediction Set Construction:
- The aggregated scores are treated as new nonconformity scores.
- A prediction set is constructed by comparing the test aggregated score $F_{test}(y)$ against the empirical quantile of the calibration aggregated scores.
- Depending on the monotonicity of $f$ , the set is defined as either $\{y \mid F_{test}(y) \le \hat{Q}_\alpha\}$ or $\{y \mid F_{test}(y) \ge \hat{q}_\alpha\}$ .

Theoretical Guarantees:

Validity: Theorem 3.3 proves that because the aggregated scores remain exchangeable, the resulting prediction set satisfies the exact marginal coverage guarantee $P(Y_{test} \in C_\alpha) \ge 1-\alpha$ .
Efficiency Bound: Theorem 3.7 provides a worst-case bound on the length of the aggregated prediction set, showing it is bounded by the maximum disagreement between models plus the maximum individual set length.

SACP++ (Efficiency-Oriented Variant):

Since any symmetric function preserves validity, the authors propose SACP++ to optimize efficiency.
They restrict the search space to parametric symmetric functions of the form $\Phi_p(x) = \sum (x_k)^p$ .
The exponent $p$ is selected via a grid search on the unlabeled test set to minimize the average prediction set length, without compromising coverage.

3. Key Contributions

Novel Framework (SACP): The first method to perform symmetric aggregation at the score level (rather than set level) while guaranteeing $1-\alpha$ coverage without requiring additional data splitting of the calibration set.
E-value Normalization: Introduces a transformation that standardizes scores across diverse models, enabling fair aggregation regardless of scale differences.
Theoretical Analysis: Provides rigorous proofs for coverage validity and derives a worst-case upper bound for the prediction set length in regression tasks.
Data-Driven Optimization (SACP++): Demonstrates how to adaptively select the aggregation function to maximize efficiency (minimize set size) while maintaining theoretical guarantees.

4. Experimental Results

The authors evaluated SACP and SACP++ on diverse datasets (OpenML regression, CIFAR-10, MNIST) against state-of-the-art baselines:

Baselines: Weighted Aggregation (Wagg), Conformal Score Aggregation (CSA), Majority Vote (CM/CR), and Best-Model Selection (BL).
Coverage: SACP and SACP++ consistently achieved the target empirical coverage (e.g., 95% for $\alpha=0.05$ ), whereas methods like CSA tended to under-cover and Majority Vote methods were often overly conservative (over-covering).
Efficiency (Set Length):
- SACP++ consistently produced the smallest prediction sets among all aggregation methods.
- In classification (CIFAR-10), SACP++ achieved the lowest variance in set length.
- In regression, SACP++ outperformed the best individual model (BL) on 5 out of 9 datasets and was the top performer among aggregation methods on 7 out of 9 datasets.
Robustness: The method remained stable across different numbers of base models ( $K=3, 5, 7$ ) and different miscoverage levels ( $\alpha$ ).

5. Significance

Bridging the Gap: SACP successfully bridges the gap between the theoretical guarantees of Conformal Prediction and the practical benefits of ensemble learning.
Practical Utility: By producing sharper (smaller) prediction sets without sacrificing reliability, SACP makes uncertainty quantification more actionable for decision-making in high-risk domains.
Flexibility: The framework is model-agnostic and allows for the integration of heterogeneous models (e.g., linear, tree-based, neural networks) into a single, efficient uncertainty estimator.
Future Direction: The work opens avenues for learning optimal aggregation functions directly via symmetric neural architectures, further enhancing the adaptability of conformal prediction.

In summary, SACP offers a mathematically rigorous, computationally efficient, and empirically superior approach to aggregating uncertainty from multiple AI models, addressing a critical bottleneck in the deployment of reliable AI systems.