From Misclassifications to Outliers: Joint Reliability Assessment in Classification

Imagine you are hiring a security guard for a high-tech museum. This guard has two very important jobs:

Spot the Fake: They need to know when someone is trying to sneak in a fake painting (Out-of-Distribution or OOD).
Admit Mistakes: They need to know when they are unsure about a real painting and should ask for help, rather than confidently saying, "That's a real Van Gogh!" when it's actually a fake (Failure Prediction).

For a long time, researchers treated these two jobs as completely separate problems. They hired one specialist to find fakes and another specialist to check the guard's confidence. But in the real world, you need one guard who can do both at the same time.

This paper argues that to build truly reliable AI, we need to evaluate these two skills together, not separately. Here is the breakdown of their solution in simple terms:

1. The Problem: The "Split Personality" Evaluation

Imagine you have two guards, Guard A and Guard B.

Guard A is amazing at spotting fakes but gets very confused when looking at real art, often making confident mistakes.
Guard B is great at looking at real art but is terrible at spotting fakes.

If you only look at their "Fake Spotting" score, Guard A wins. If you only look at their "Real Art" score, Guard B wins. But if you ask, "Who is the better overall security guard?" the answer is unclear because the old scoring systems didn't let you compare them fairly when they are doing both jobs at once.

The authors say: "Stop judging them separately. We need a score that tells us how well they handle the whole room of art, including the fakes and the tricky real pieces."

2. The Solution: The "Double-Check" System

The authors propose a new way to test the guard called Double Scoring. Instead of asking the guard for one answer ("Is this safe?"), they ask two questions with two different "checkpoints":

Checkpoint 1 (The OOD Score): "Does this look like something from our museum collection, or is it an intruder?"
Checkpoint 2 (The ID Score): "If it is from our collection, how confident are you that you identified it correctly?"

To accept a painting as "Safe," it must pass both checkpoints. If it fails the first, it's an intruder. If it passes the first but fails the second, the guard says, "I'm not sure, let's get a human expert."

3. The New Scorecards: DS-F1 and DS-AURC

To measure how good this Double-Check system is, they invented two new scorecards:

DS-F1 (The "Best Day" Score): This asks, "What is the absolute best performance this guard can achieve if we tune their settings perfectly?" It finds the sweet spot where they catch the most fakes and make the fewest mistakes on real art.
DS-AURC (The "Consistency" Score): This asks, "How does the guard perform if we change the rules slightly?" It checks if the guard stays reliable even when the situation gets a little harder or easier.

The Analogy:
Think of DS-F1 as finding the perfect gear on a bicycle to go up a hill.
Think of DS-AURC as checking if the bike handles well on every gear, not just the perfect one.

4. The New Guard: SURE+

The authors didn't just invent a better test; they built a better guard called SURE+.

Previous guards (like the standard "SURE" model) were good at spotting mistakes on real art but struggled when fakes showed up. SURE+ is like a guard who has been trained with a special "mix-and-match" technique. They practice with:

Distorted images (to learn to ignore weird lighting).
Pixel noise (to learn to ignore static).
Confidence calibration (learning to say "I don't know" when they really don't know).

The Result: SURE+ is the first guard that is truly reliable in both scenarios. It catches the fakes and knows when to stop and ask for help on tricky real paintings.

5. The Big Discovery: "Far" vs. "Near" Fakes

The paper found something interesting about the "intruders":

Far-OOD (The Obvious Fakes): If an intruder walks in wearing a clown suit in a formal museum, the guard spots them easily. The new system works great here.
Near-OOD (The Subtle Fakes): If an intruder wears a suit that looks almost exactly like the museum staff's uniform, it's much harder to spot. The new system helps a little here, but it's still a tough challenge.

Summary

This paper is a call to action for the AI community: Stop testing AI in isolation.

If you want an AI that is safe to use in the real world (like for self-driving cars or medical diagnosis), you can't just test if it's smart. You have to test if it knows when it's confused and when it's looking at something completely foreign.

They provided:

A new rulebook (Double Scoring) to test AI fairly.
New scorecards (DS-F1 and DS-AURC) to measure reliability.
A new champion guard (SURE+) that actually passes the test.

By using this new framework, we can finally build AI systems that don't just guess confidently, but know when to say, "I'm not sure, please check this."

1. Problem Statement

Deploying machine learning classifiers in safety-critical real-world applications requires more than high benchmark accuracy. A reliable system must simultaneously:

Detect Out-of-Distribution (OOD) inputs: Identify samples that deviate from the training distribution and should not be trusted.
Predict In-Distribution (ID) Failures: Anticipate when the model will misclassify an ID sample (failure prediction) and assign low confidence to it.

The Core Challenge: Existing research treats OOD detection and failure prediction as separate problems with isolated evaluation metrics. This fragmentation leads to misleading conclusions; a model might excel at OOD detection but fail to predict ID errors, or vice versa. Current evaluation frameworks (e.g., OpenOOD) often rely on single scoring functions and single thresholds, which cannot effectively capture the joint decision-making process required for a truly robust system. There is a lack of unified metrics and training strategies that address both reliability dimensions simultaneously.

2. Methodology

The paper proposes a unified framework comprising new evaluation metrics and an improved training algorithm.

A. Unified Evaluation Framework: Double Scoring

The authors argue that reliability requires a double-scoring approach where two distinct scores are used for two distinct thresholds:

$s_{OOD}$ : An OOD detection score (higher values indicate the sample is likely ID).
$s_{ID}$ : An in-distribution confidence score (higher values indicate the predicted label is likely correct).

Based on these scores, a sample is accepted only if it passes both thresholds ( $\tau_{OOD}$ and $\tau_{ID}$ ). This divides samples into four categories: True Accept, True Reject, False Accept, and False Reject.

New Metrics:
To quantify performance in this double-scoring setting, the authors introduce:

DS-F1 (Double Scoring F1): Extends the standard F1-score to find the optimal operating point across the 2D surface of $(\tau_{OOD}, \tau_{ID})$ . It maximizes the harmonic mean of Precision and Recall, where "True Positives" are correctly classified ID samples that pass both thresholds.
DS-AURC (Double Scoring Area Under Risk-Coverage): Generalizes the AURC metric. Instead of a single risk curve, it aggregates the minimum risk achievable at each coverage level across all possible threshold pairs. This provides a lower bound on the risk, representing the best possible performance the system can achieve under selective prediction.

Key Property: The authors prove that DS-F1 is guaranteed to be $\ge$ standard F1, and DS-AURC is guaranteed to be $\le$ standard AURC. Thus, double scoring never degrades evaluation outcomes and offers a more faithful assessment of reliability.

B. Unified Training Framework: SURE+

Building on the existing SURE (Selective Classification with Uncertainty and Robustness) framework, the authors propose SURE+, a streamlined training recipe designed to maximize joint ID and OOD reliability. Key components include:

Data Augmentation: Combines RegMixup (feature interpolation) and RegPixMix (pixel-level perturbation) to enforce robustness at both semantic and pixel levels.
Optimization: Replaces standard SAM with F-SAM (Flatness-Aware Sharpness-Aware Minimization) to encourage flat minima and improve stability.
Ensembling: Replaces Stochastic Weight Averaging (SWA) with Exponential Moving Average (EMA) of parameters, combined with Re-normalized Batch Normalization (Re-BN) statistics, to stabilize predictions under mixed distributions.
Simplification: Removes complex components from the original SURE (like Correctness Ranking Loss and Cosine Similarity Classifier) that offered marginal gains but added complexity.

3. Key Contributions

Unified Perspective: Demonstrated that OOD detection and failure prediction are complementary aspects of reliability that must be evaluated jointly to avoid misleading rankings.
Novel Metrics: Introduced DS-F1 and DS-AURC, which extend selective classification metrics to a double-scoring setting, providing a principled way to evaluate models on mixed ID/OOD datasets.
SURE+ Algorithm: Proposed a simplified yet powerful training framework that integrates state-of-the-art augmentation, optimization, and ensembling techniques to achieve superior joint reliability.
Comprehensive Benchmarking: Validated the framework on the OpenOOD benchmark across diverse datasets (CIFAR-100, ImageNet-1K) and architectures (ResNet-18, DINOv3 ViT-L/16).

4. Experimental Results

Experiments were conducted on CIFAR-100 and ImageNet-1K with various Near-OOD and Far-OOD datasets.

Superiority of Double Scoring: Across all datasets and post-hoc methods (MSP, ReAct, VIM, etc.), the double-scoring framework consistently outperformed single-scoring approaches. DS-F1 scores were higher, and DS-AURC scores were lower (better) than their single-score counterparts.
Performance of SURE+:
- On CIFAR-100 (ResNet-18), SURE+ achieved 81.66% ID accuracy and the best DS-F1/DS-AURC scores among all compared methods (including SURE, Mixup, CutMix, etc.).
- On ImageNet-1K (DINOv3 ViT-L/16), SURE+ achieved 88.49% ID accuracy and maintained the top performance in joint reliability metrics.
Far-OOD vs. Near-OOD: The study revealed that while double scoring and advanced OOD methods provide significant gains for Far-OOD shifts (visually distinct data), the benefits are marginal for Near-OOD shifts (visually similar data). This highlights a current limitation in distinguishing fine-grained semantic differences.
Threshold Analysis: Controlled experiments showed that the optimal threshold pairs found via DS-F1 on validation sets generalize well to test sets, reducing the gap between threshold selection and real-world deployment.

5. Significance

Paradigm Shift: The paper challenges the community to move away from siloed evaluations of OOD detection and failure prediction, advocating for a unified "Joint Reliability" perspective.
Practical Guidance: The introduction of DS-F1 and DS-AURC provides practitioners with more intuitive and accurate tools to select models for safety-critical applications, ensuring that a model is not just accurate on known data but also robust against unknowns and self-aware of its errors.
State-of-the-Art Baseline: SURE+ establishes a new, reproducible baseline for reliable classification, demonstrating that careful integration of existing robust training techniques (rather than novel, complex losses) yields the best results for joint reliability.
Future Directions: The work identifies the "Near-OOD" bottleneck as a critical area for future research, suggesting that current post-hoc methods and training objectives struggle with subtle distribution shifts.

In summary, this work provides a comprehensive solution to the "reliability gap" in machine learning by unifying evaluation metrics and training strategies, offering a path toward more trustworthy AI systems in real-world deployments.