Are foundation models for computer vision good conformal predictors?

This paper demonstrates that vision and vision-language foundation models are well-suited for Conformal Prediction, revealing that while few-shot adaptation improves conformal scores and APS ensures robust coverage, calibrating model confidence can paradoxically degrade the efficiency of adaptive conformal sets.

Leo Fillioux, Julio Silva-Rodríguez, Ismail Ben Ayed, Paul-Henry Cournède, Maria Vakalopoulou, Stergios Christodoulidis, Jose Dolz

Published 2026-02-17
📖 5 min read🧠 Deep dive

Imagine you have a super-smart robot chef (a Foundation Model) that has tasted millions of dishes and can identify ingredients with incredible accuracy. You want to hire this chef to work in a high-stakes kitchen, like a hospital cafeteria or a space station, where a mistake could be dangerous.

Before you hire them, you ask: "How sure are you about your answers?"

This paper is like a rigorous safety inspection for these robot chefs. It tests them using a special tool called Conformal Prediction (CP).

The Core Concept: The "Safety Net" Basket

Usually, when a model makes a guess, it says, "I'm 90% sure this is a cat." But what if it's wrong?

Conformal Prediction changes the game. Instead of giving a single guess, it gives you a basket of possibilities.

  • The Rule: "I promise that 95% of the time, the real answer will be inside this basket."
  • The Trade-off: If the model is very unsure, the basket gets bigger (e.g., "It's a cat, a dog, or a fox"). If it's very sure, the basket is tiny (e.g., "It's definitely a cat").

The goal is to keep the basket as small as possible (for efficiency) while never breaking the promise that the real answer is inside (for safety).

The Big Question

The authors asked: "Are these new, super-powerful Foundation Models (like DINOv2 or CLIP) actually good at using this safety net?"

They tested 17 different "chefs" (models) across various scenarios. Here is what they found, explained with analogies:

1. The "Vision Transformer" Chefs are the Best

  • The Finding: Models built with "Vision Transformers" (like DINO and CLIP) are much better at creating tight, efficient safety baskets than older models based on Convolutional Neural Networks (CNNs).
  • The Analogy: Think of the older models as old-school detectives who rely on looking at individual clues (edges, textures) one by one. They get confused easily. The new Transformer models are like modern detectives who look at the whole picture at once. They understand the context better, so they know exactly how many suspects to put in the "safety basket" without making it too big.

2. The "Confidence Calibration" Trap

  • The Finding: People often try to "calibrate" these models to make their confidence scores more honest (e.g., if they say 80%, they should be right 80% of the time). The paper found that calibrating these models actually makes the safety baskets bigger and less efficient.
  • The Analogy: Imagine a weather forecaster who is usually very confident. You ask them to be more "honest" about their uncertainty. They start saying, "Well, it might rain, or it might not, or it might snow..." just to be safe.
    • The Result: Their "safety basket" of weather possibilities becomes huge. While they are technically more "honest" (calibrated), they become less useful because the basket is too big to act on. The paper suggests that for these specific models, being slightly overconfident is actually better for keeping the safety basket small.

3. The "Chameleon" Effect (Domain Shifts)

  • The Finding: When the data changes (e.g., the chef is used to cooking in a sunny kitchen but suddenly has to cook in a dark, rainy one), the safety net usually breaks. However, one method called APS (Adaptive Prediction Sets) was incredibly robust. It kept its promise even when the environment changed, though the baskets did get a bit bigger.
  • The Analogy: Most safety nets are like glass: they work great in a controlled room but shatter when you take them outside. The APS method is like a bungee cord. When the environment gets weird (domain shift), the cord stretches (the basket gets bigger) to catch the falling object, ensuring the safety promise is kept, even if it's less precise.

4. Learning a Few New Tricks (Few-Shot Learning)

  • The Finding: When you teach these models a few new things with just a handful of examples (few-shot), they actually get better at using the safety net compared to when they try to guess without any help (zero-shot).
  • The Analogy: If you ask a chef to guess a secret recipe with no hints (Zero-Shot), they might throw a huge basket of ingredients at you just in case. But if you give them three hints (Few-Shot), they narrow down the possibilities immediately. The safety basket becomes much smaller and more useful.

The Final Verdict

The paper concludes that Foundation Models are excellent candidates for safety-critical applications, provided you use the right tools.

  • Best Tool: Use APS (Adaptive Prediction Sets). It's the most reliable "safety net" that won't break when things get weird, even if it sometimes makes the basket a little larger.
  • Best Chef: Use models based on Vision Transformers (like DINO or CLIP). They naturally understand the world better than older models.
  • Avoid: Don't bother "calibrating" them to be perfectly honest about their confidence; it just makes the safety baskets too big to be useful.

In short: These new AI models are smart enough to know when they are unsure, but you have to give them the right kind of "safety net" (APS) to catch them when they stumble.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →