Are foundation models for computer vision good conformal predictors?

Imagine you have a super-smart robot chef (a Foundation Model) that has tasted millions of dishes and can identify ingredients with incredible accuracy. You want to hire this chef to work in a high-stakes kitchen, like a hospital cafeteria or a space station, where a mistake could be dangerous.

Before you hire them, you ask: "How sure are you about your answers?"

This paper is like a rigorous safety inspection for these robot chefs. It tests them using a special tool called Conformal Prediction (CP).

The Core Concept: The "Safety Net" Basket

Usually, when a model makes a guess, it says, "I'm 90% sure this is a cat." But what if it's wrong?

Conformal Prediction changes the game. Instead of giving a single guess, it gives you a basket of possibilities.

The Rule: "I promise that 95% of the time, the real answer will be inside this basket."
The Trade-off: If the model is very unsure, the basket gets bigger (e.g., "It's a cat, a dog, or a fox"). If it's very sure, the basket is tiny (e.g., "It's definitely a cat").

The goal is to keep the basket as small as possible (for efficiency) while never breaking the promise that the real answer is inside (for safety).

The Big Question

The authors asked: "Are these new, super-powerful Foundation Models (like DINOv2 or CLIP) actually good at using this safety net?"

They tested 17 different "chefs" (models) across various scenarios. Here is what they found, explained with analogies:

1. The "Vision Transformer" Chefs are the Best

The Finding: Models built with "Vision Transformers" (like DINO and CLIP) are much better at creating tight, efficient safety baskets than older models based on Convolutional Neural Networks (CNNs).
The Analogy: Think of the older models as old-school detectives who rely on looking at individual clues (edges, textures) one by one. They get confused easily. The new Transformer models are like modern detectives who look at the whole picture at once. They understand the context better, so they know exactly how many suspects to put in the "safety basket" without making it too big.

2. The "Confidence Calibration" Trap

The Finding: People often try to "calibrate" these models to make their confidence scores more honest (e.g., if they say 80%, they should be right 80% of the time). The paper found that calibrating these models actually makes the safety baskets bigger and less efficient.
The Analogy: Imagine a weather forecaster who is usually very confident. You ask them to be more "honest" about their uncertainty. They start saying, "Well, it might rain, or it might not, or it might snow..." just to be safe.
- The Result: Their "safety basket" of weather possibilities becomes huge. While they are technically more "honest" (calibrated), they become less useful because the basket is too big to act on. The paper suggests that for these specific models, being slightly overconfident is actually better for keeping the safety basket small.

3. The "Chameleon" Effect (Domain Shifts)

The Finding: When the data changes (e.g., the chef is used to cooking in a sunny kitchen but suddenly has to cook in a dark, rainy one), the safety net usually breaks. However, one method called APS (Adaptive Prediction Sets) was incredibly robust. It kept its promise even when the environment changed, though the baskets did get a bit bigger.
The Analogy: Most safety nets are like glass: they work great in a controlled room but shatter when you take them outside. The APS method is like a bungee cord. When the environment gets weird (domain shift), the cord stretches (the basket gets bigger) to catch the falling object, ensuring the safety promise is kept, even if it's less precise.

4. Learning a Few New Tricks (Few-Shot Learning)

The Finding: When you teach these models a few new things with just a handful of examples (few-shot), they actually get better at using the safety net compared to when they try to guess without any help (zero-shot).
The Analogy: If you ask a chef to guess a secret recipe with no hints (Zero-Shot), they might throw a huge basket of ingredients at you just in case. But if you give them three hints (Few-Shot), they narrow down the possibilities immediately. The safety basket becomes much smaller and more useful.

The Final Verdict

The paper concludes that Foundation Models are excellent candidates for safety-critical applications, provided you use the right tools.

Best Tool: Use APS (Adaptive Prediction Sets). It's the most reliable "safety net" that won't break when things get weird, even if it sometimes makes the basket a little larger.
Best Chef: Use models based on Vision Transformers (like DINO or CLIP). They naturally understand the world better than older models.
Avoid: Don't bother "calibrating" them to be perfectly honest about their confidence; it just makes the safety baskets too big to be useful.

In short: These new AI models are smart enough to know when they are unsure, but you have to give them the right kind of "safety net" (APS) to catch them when they stumble.

1. Problem Statement

The rapid adoption of large-scale foundation models (e.g., DINOv2, CLIP) in high-stakes domains like healthcare and security necessitates rigorous uncertainty quantification. While these models excel at zero-shot and few-shot tasks, they often suffer from miscalibration, bias, and overconfidence.

Standard calibration methods (e.g., Temperature Scaling) aim to align predicted probabilities with actual correctness but lack theoretical guarantees on the coverage of the true class. Conformal Prediction (CP) offers a statistical framework to generate prediction sets with guaranteed marginal coverage (i.e., the true label is included in the set with probability $1-\alpha$ ). However, it remains unclear how foundation models behave under CP, particularly regarding:

Their efficiency (size of the prediction set) compared to traditional supervised models.
Their robustness under distributional shifts (Out-of-Distribution data).
The impact of confidence calibration on CP performance.
The efficacy of few-shot adaptation (e.g., Prompt Learning, Adapters) on CP metrics.

2. Methodology

The authors conducted an extensive empirical study involving 17 vision foundation models across 17 datasets (including CIFAR-10/100, ImageNet, and 10 fine-grained benchmarks).

Models Evaluated:

Vision-only: DINO, DINOv2 (ViT-based, self-supervised), VICReg (CNN-based, self-supervised).
Vision-Language (VLMs): CLIP (ViT and ConvNeXt backbones), MetaCLIP, LLaVa, Phi 3.5.
Baseline: A supervised ViT-B trained on ImageNet for comparison.

Conformal Prediction Methods:
Three non-conformity score functions were tested:

LAC (Least Ambiguous Classifier): Uses raw softmax probabilities ( $1 - \pi_x(y)$ ).
APS (Adaptive Prediction Sets): Accumulates sorted softmax probabilities until the threshold is met; adaptive to uncertainty.
RAPS (Regularized APS): Adds a regularization term to penalize large set sizes, aiming to improve efficiency.

Experimental Scenarios:

General Setting: Standard in-distribution (ID) evaluation with linear probing heads.
Distributional Shifts: Testing on ImageNet variants (ImageNet-A, -R, -Sketch, -V2) where calibration and test distributions differ.
Confidence Calibration: Applying Temperature Scaling (TS) to models before conformalization.
Few-Shot Adaptation: Adapting CLIP to downstream tasks using Prompt Learning (CoOp, KgCoOp) and Adapters (ZSLP, CLAP) with 16 shots.

Metrics:

Set Size: Average number of classes in the prediction set (Efficiency).
Marginal Coverage: Empirical frequency of the true label being in the set.
Coverage Gap: Average deviation between class-conditional coverage and the target $1-\alpha$ .
MCCC (Min Class-Conditional Coverage): The lowest coverage achieved across any specific class.

3. Key Contributions & Findings

A. Foundation Models vs. Traditional Models

Superior Conformal Performance: Foundation models (especially those using Vision Transformers like DINO and CLIP) yield smaller prediction sets and higher class-conditional coverage compared to traditional fully-supervised models (e.g., ViT trained solely on ImageNet).
Architecture Matters: Models integrating Vision Transformers significantly outperform those using Convolutional Neural Networks (CNNs) in conformal metrics, particularly under domain shifts.

B. Performance of CP Methods

APS is the Most Robust: Adaptive Prediction Sets (APS) consistently provides the best marginal coverage and class-conditional coverage, especially under distributional shifts. It maintains guarantees even when the domain changes drastically (e.g., ImageNet-A), though at the cost of larger set sizes.
RAPS Trade-off: RAPS offers the best set efficiency (smallest sets) in ideal conditions but suffers significantly in coverage guarantees under distribution shifts. The regularization term prevents the set from expanding enough to cover difficult classes, leading to coverage gaps.
LAC Limitations: LAC shows high variability in class-conditional coverage, leading to larger coverage gaps.

C. Impact of Confidence Calibration

Efficiency Degradation: Contrary to intuition, applying Temperature Scaling (TS) to improve model calibration increases the size of conformal sets (reducing efficiency), particularly for adaptive methods like APS.
Mechanism: Calibration smooths the softmax distribution, lowering the confidence of the top class. To maintain the required coverage probability, CP must include more classes.
Coverage Gap: While calibration increases set size, it can slightly improve the coverage gap for APS, bringing it closer to the optimal point, but this is a trade-off against efficiency.

D. Few-Shot Adaptation

ID Performance: Few-shot adaptation of VLMs (e.g., CLIP) generally leads to smaller set sizes and lower coverage gaps compared to zero-shot predictions on in-distribution data.
OOD Performance: On out-of-distribution data, few-shot adaptation does not consistently improve coverage gaps compared to zero-shot, and in some cases, the "overconfidence" introduced by adaptation can be detrimental.

4. Significance and Implications

Safe Deployment: The study provides empirical evidence that foundation models, particularly Vision Transformers, are well-suited for safe deployment in critical applications when paired with Adaptive Prediction Sets (APS). APS ensures that the theoretical coverage guarantees hold even when the model encounters unfamiliar data distributions.
Calibration Caution: Practitioners should be aware that standard calibration techniques (like TS) may degrade the efficiency of conformal predictors. In safety-critical systems where minimizing the prediction set size is less important than ensuring the true label is captured, calibration might be counterproductive.
Method Selection:
- Use APS when coverage guarantees and robustness to domain shifts are the priority (e.g., medical diagnosis).
- Use RAPS when set efficiency (minimizing the number of candidates) is the priority and the data distribution is stable.
Future Directions: The findings suggest that the self-supervised and contrastive learning paradigms used to train foundation models inherently produce feature representations that are more amenable to conformal prediction than traditional supervised training.

Conclusion

The paper concludes that vision foundation models are excellent candidates for conformal prediction. Among the methods tested, APS is the most reliable for ensuring safety guarantees in real-world, shifting environments, despite its tendency to produce larger prediction sets. The study highlights a critical trade-off: while calibration improves model confidence, it often harms conformal efficiency, suggesting that for conformal prediction, raw model outputs (or specific adaptation strategies) may be preferable to post-hoc calibrated outputs.