Know When to Abstain: Optimal Selective Classification with Likelihood Ratios

Imagine you are a weather forecaster. Your job is to predict if it will rain tomorrow.

Sometimes, the sky is clear, the clouds are fluffy, and you are 100% sure: "It's going to be sunny!" You make a prediction.
Other times, the sky is a chaotic mess of dark clouds, strange wind patterns, and fog. You look at your data, and you feel shaky. You aren't sure if it's a storm or just a weird cloud formation.

The Problem:
Most computer models (AI) are like terrible weather forecasters who never admit they are unsure. Even when the sky is a chaotic mess, they will confidently shout, "It's going to be sunny!" and get it wrong. This is dangerous. In medicine, finance, or self-driving cars, a confident wrong answer is often worse than no answer at all.

The Solution: "Knowing When to Abstain"
This paper introduces a way to teach AI models to say, "I don't know, please ask a human expert." This is called Selective Classification. The model gets to choose: either make a prediction (Accept) or stay silent (Abstain).

The goal is simple: Only make predictions when you are sure, and stay quiet when you are confused.

The Old Way vs. The New Way

The Old Way (Heuristics):
Previously, scientists tried to figure out when a model was unsure by looking at "confidence scores."

Analogy: Imagine checking a thermometer. If the temperature is high, the model is "hot" (confident). If it's low, it's "cold" (uncertain).
The Flaw: This is like checking a thermometer in a blizzard. The thermometer might say "Hot" just because the wind is blowing, not because it's actually sunny. These old methods often fail when the weather changes (e.g., moving from sunny photos to sketches or corrupted images).

The New Way (The "Likelihood Ratio" Lens):
The authors of this paper looked at a classic rule from statistics called the Neyman-Pearson Lemma.

The Analogy: Imagine you are a detective trying to solve a crime. You have two suspects: Mr. Correct and Mr. Wrong.
- You look at the evidence (the input data).
- You ask: "How much more likely is this evidence to have come from Mr. Correct than from Mr. Wrong?"
- If the evidence looks much more like Mr. Correct, you make a prediction.
- If the evidence looks like a toss-up between the two, you abstain (say "I don't know").

The paper argues that the perfect way to decide is to calculate this Likelihood Ratio:

Score = (How likely is this a correct prediction?) / (How likely is this a wrong prediction?)

The Two New Tools

The authors realized that calculating this perfect ratio is hard, so they built two new "detective tools" to approximate it:

The "Correct vs. Wrong" Map (Distance-Based):
- The Metaphor: Imagine a map of a city. The "Correct" neighborhood is full of happy, well-dressed people. The "Wrong" neighborhood is full of confused people.
- When a new person walks in, the old tools just asked, "Are they close to the city center?"
- The New Tool asks: "Are they closer to the Correct neighborhood or the Wrong neighborhood?"
- They created two versions:
  - $\Delta$ -MDS: Uses a "straight-line" map (good for standard, supervised models).
  - $\Delta$ -KNN: Uses a "neighborhood" map (good for complex models like Vision-Language models). It looks at the k closest neighbors to see if they are mostly "Correct" or "Wrong."
The "Hybrid" Strategy:
- Sometimes the map is helpful, but sometimes the "confidence score" (the thermometer) is also useful.
- The authors found that combining the map distance with the confidence score works even better. It's like having a GPS and a weather report. You get the best of both worlds.

Why This Matters (The "Covariate Shift" Problem)

The paper focuses on a specific, tricky scenario called Covariate Shift.

The Metaphor: Imagine you trained your weather forecaster using photos of real clouds.
Now, you ask them to predict the weather based on cartoon drawings of clouds or paintings of clouds.
The meaning (it's a cloud) is the same, but the look is totally different.
Old AI models get confused and make confident mistakes.
The new methods in this paper are robust. They realize, "Hey, this looks like a cartoon, not a photo. I'm not sure if my 'Correct' map applies here," so they wisely abstain instead of guessing.

The Results

The authors tested this on:

Vision: Identifying objects in photos, sketches, and corrupted images.
Language: Understanding reviews and text.

The Verdict:
Their new "Detective" methods (especially the combination of distance and confidence) consistently outperformed all the old methods. They made fewer mistakes and knew exactly when to say, "I don't know," saving the day in tricky situations.

Summary in One Sentence

This paper teaches AI models to stop guessing when they are confused by using a smart statistical rule that compares "how likely this is to be right" versus "how likely this is to be wrong," ensuring they only speak up when they are truly confident.

1. Problem Statement

Selective Classification aims to improve the reliability of machine learning models by allowing them to abstain from making predictions on uncertain inputs, rather than forcing a prediction for every sample. While existing methods (e.g., Maximum Softmax Probability, Logit Margins, Distance-based methods) exist, two critical gaps remain:

Lack of Theoretical Unification: There is no general, principled framework for designing optimal selector functions for modern deep networks.
Covariate Shift Neglect: Most evaluations assume an i.i.d. setting (test data matches training distribution). Real-world deployments often face covariate shifts (where the input distribution $p(x)$ changes, but the label space $p(y|x)$ remains fixed), such as when a model trained on photos encounters sketches or corrupted images. This scenario is underexplored, particularly in Vision-Language Models (VLMs) where label sets are dynamic.

2. Methodology

The authors propose a framework grounded in the Neyman-Pearson (NP) Lemma, a classical statistical result defining the optimal hypothesis test.

Theoretical Foundation

Hypothesis Formulation: The problem is framed as a binary hypothesis test:
- $H_0$ : The classifier makes a correct prediction ( $C$ ).
- $H_1$ : The classifier makes an incorrect prediction ( $\neg C$ ).
Optimality Criterion: According to the NP Lemma, the optimal decision rule to minimize Type II error (false acceptance of wrong predictions) for a fixed Type I error (false rejection of correct predictions) is a Likelihood Ratio Test (LRT).
Optimal Score: The optimal selection score $s(x)$ is a monotonic transformation of the likelihood ratio:
$s(x) \propto \frac{p_c(x)}{p_w(x)}$
where $p_c(x)$ is the density of inputs classified correctly, and $p_w(x)$ is the density of inputs classified incorrectly. This formulation naturally handles covariate shifts because $p_c$ and $p_w$ encompass all samples (both in-distribution and shifted) based on their correctness.

Proposed Selectors

The paper derives two new distance-based selectors and a combination strategy based on this framework:

$\Delta$ -MDS (Delta Mahalanobis Distance):
- Concept: Instead of estimating a single Gaussian distribution per class, the method maintains separate statistics for correctly and incorrectly classified training samples.
- Mechanism: It computes the difference between the Mahalanobis distance to the "correct" cluster mean and the "wrong" cluster mean.
- Assumption: Features follow a Gaussian distribution.
- Score: $s_{\Delta\text{-MDS}}(x) = D_{MDS}(x; \mu^c, \Sigma^c) - D_{MDS}(x; \mu^w, \Sigma^w)$ .
$\Delta$ -KNN (Delta k-Nearest Neighbors):
- Concept: A non-parametric approach that estimates the likelihood ratio using k-NN distances to sets of correctly and incorrectly classified training samples.
- Mechanism: It calculates the difference in log-distances to the $k$ -th nearest neighbors in the "correct" set ( $A_c$ ) versus the "wrong" set ( $A_w$ ).
- Refinement: The authors use the average log-distance to the top $k$ neighbors rather than just the $k$ -th distance for smoother empirical performance.
- Score: $s_{\Delta\text{-KNN}}(x) = \text{AvgLogDist}(x, A_c) - \text{AvgLogDist}(x, A_w)$ .
Linear Combination Strategy:
- The paper proves that a linear combination of two NP-optimal scores (e.g., a distance-based score and a logit-based score) remains NP-optimal under specific "tilted" density assumptions.
- Implementation: They combine $\Delta$ -MDS or $\Delta$ -KNN with Raw Logits (RLog), a robust logit-based score, to leverage both geometric feature structures and classifier decision boundaries.

3. Key Contributions

NP-Based Framework: First introduction of a Neyman-Pearson-based framework for selective classification, unifying existing heuristics (like MSP and RLog) as approximations of the likelihood ratio.
Novel Selectors: Proposal of $\Delta$ -MDS and $\Delta$ -KNN, which explicitly model the distributions of correct vs. incorrect predictions to handle covariate shifts robustly.
Comprehensive Evaluation: Extensive experiments across vision (ImageNet variants) and language (Amazon Reviews) tasks, covering both supervised models (EVA, ResNet50) and VLMs (CLIP).
Covariate Shift Focus: Demonstrating that likelihood-ratio-based methods significantly outperform baselines specifically in covariate shift scenarios, a critical gap in current literature.

4. Experimental Results

The methods were evaluated using Area Under the Risk-Coverage Curve (AURC) and Normalized AURC (NAURC) across various datasets (ImageNet-R, ImageNet-A, ImageNet-C, etc.).

Performance on VLMs (CLIP):
- The proposed $\Delta$ -KNN-RLog and $\Delta$ -MDS-RLog combinations achieved the lowest AURC and NAURC.
- Compared to standard baselines (MSP, MDS, KNN), the NP-informed methods reduced average AURC by approximately 50%.
- $\Delta$ -KNN-RLog was the top performer, showing superior stability at low coverage levels.
Performance on Supervised Models (EVA, ResNet50):
- $\Delta$ -MDS-RLog achieved the best results.
- The results support the hypothesis that MDS-based methods are particularly effective for supervised models due to the connection between softmax classifiers and Gaussian Discriminant Analysis.
Language Tasks:
- On Amazon Reviews, $\Delta$ -MDS-MSP performed best, followed closely by $\Delta$ -MDS-RLog.
- This highlights that the optimal linear combination depends on the specific model architecture (e.g., MSP outperformed RLog in this specific language setting).
Ablation Studies:
- Sample Efficiency: $\Delta$ -KNN is highly robust, maintaining strong performance even with only 0.1% of labeled training data. $\Delta$ -MDS requires more data to estimate covariance matrices accurately.
- Design Choices: Averaging the top- $k$ distances in $\Delta$ -KNN yielded measurable improvements over using a single $k$ -th neighbor distance.

5. Significance

Theoretical Rigor: Moves selective classification from heuristic confidence thresholds to a statistically optimal framework based on likelihood ratios.
Robustness to Distribution Shift: Provides a practical solution for the increasingly common problem of covariate shift in deployed AI systems, particularly for VLMs where the label space is large and variable.
Model Agnosticism: The proposed post-hoc methods do not require retraining the base model, making them immediately applicable to existing state-of-the-art models (like CLIP and EVA).
Practical Impact: The linear combination strategy offers a simple, effective recipe for practitioners to build more reliable AI systems that know "when to abstain," thereby reducing error rates in critical applications.

In conclusion, the paper establishes that modeling the likelihood ratio between correct and incorrect predictions is the key to optimal selective classification, offering a unified and robust approach that outperforms existing state-of-the-art baselines, especially under challenging distribution shifts.

Know When to Abstain: Optimal Selective Classification with Likelihood Ratios

The Old Way vs. The New Way

The Two New Tools

Why This Matters (The "Covariate Shift" Problem)

The Results

Summary in One Sentence

1. Problem Statement

2. Methodology

Theoretical Foundation

Proposed Selectors

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Horseshoe Priors and MDP

Observable Geometry of Singular Statistical Models

Conditional Independence under Infinite Measures and Poisson Point Processes

Sharp Debiasing for Smooth Functional Estimation in Banach Spaces

Opponent-Adjusted Evaluation of NFL Pass Blocking and Pass Rushing Performance