From Misclassifications to Outliers: Joint Reliability Assessment in Classification

This paper proposes a unified evaluation framework with new metrics (DS-F1 and DS-AURC) and an improved method (SURE+) to jointly assess and enhance classifier reliability by integrating out-of-distribution detection and in-distribution failure prediction, demonstrating that double scoring functions significantly outperform traditional single scoring approaches.

Yang Li, Youyang Sha, Yinzhi Wang, Timothy Hospedales, Xi Shen, Shell Xu Hu, Xuanlong Yu

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are hiring a security guard for a high-tech museum. This guard has two very important jobs:

  1. Spot the Fake: They need to know when someone is trying to sneak in a fake painting (Out-of-Distribution or OOD).
  2. Admit Mistakes: They need to know when they are unsure about a real painting and should ask for help, rather than confidently saying, "That's a real Van Gogh!" when it's actually a fake (Failure Prediction).

For a long time, researchers treated these two jobs as completely separate problems. They hired one specialist to find fakes and another specialist to check the guard's confidence. But in the real world, you need one guard who can do both at the same time.

This paper argues that to build truly reliable AI, we need to evaluate these two skills together, not separately. Here is the breakdown of their solution in simple terms:

1. The Problem: The "Split Personality" Evaluation

Imagine you have two guards, Guard A and Guard B.

  • Guard A is amazing at spotting fakes but gets very confused when looking at real art, often making confident mistakes.
  • Guard B is great at looking at real art but is terrible at spotting fakes.

If you only look at their "Fake Spotting" score, Guard A wins. If you only look at their "Real Art" score, Guard B wins. But if you ask, "Who is the better overall security guard?" the answer is unclear because the old scoring systems didn't let you compare them fairly when they are doing both jobs at once.

The authors say: "Stop judging them separately. We need a score that tells us how well they handle the whole room of art, including the fakes and the tricky real pieces."

2. The Solution: The "Double-Check" System

The authors propose a new way to test the guard called Double Scoring. Instead of asking the guard for one answer ("Is this safe?"), they ask two questions with two different "checkpoints":

  • Checkpoint 1 (The OOD Score): "Does this look like something from our museum collection, or is it an intruder?"
  • Checkpoint 2 (The ID Score): "If it is from our collection, how confident are you that you identified it correctly?"

To accept a painting as "Safe," it must pass both checkpoints. If it fails the first, it's an intruder. If it passes the first but fails the second, the guard says, "I'm not sure, let's get a human expert."

3. The New Scorecards: DS-F1 and DS-AURC

To measure how good this Double-Check system is, they invented two new scorecards:

  • DS-F1 (The "Best Day" Score): This asks, "What is the absolute best performance this guard can achieve if we tune their settings perfectly?" It finds the sweet spot where they catch the most fakes and make the fewest mistakes on real art.
  • DS-AURC (The "Consistency" Score): This asks, "How does the guard perform if we change the rules slightly?" It checks if the guard stays reliable even when the situation gets a little harder or easier.

The Analogy:
Think of DS-F1 as finding the perfect gear on a bicycle to go up a hill.
Think of DS-AURC as checking if the bike handles well on every gear, not just the perfect one.

4. The New Guard: SURE+

The authors didn't just invent a better test; they built a better guard called SURE+.

Previous guards (like the standard "SURE" model) were good at spotting mistakes on real art but struggled when fakes showed up. SURE+ is like a guard who has been trained with a special "mix-and-match" technique. They practice with:

  • Distorted images (to learn to ignore weird lighting).
  • Pixel noise (to learn to ignore static).
  • Confidence calibration (learning to say "I don't know" when they really don't know).

The Result: SURE+ is the first guard that is truly reliable in both scenarios. It catches the fakes and knows when to stop and ask for help on tricky real paintings.

5. The Big Discovery: "Far" vs. "Near" Fakes

The paper found something interesting about the "intruders":

  • Far-OOD (The Obvious Fakes): If an intruder walks in wearing a clown suit in a formal museum, the guard spots them easily. The new system works great here.
  • Near-OOD (The Subtle Fakes): If an intruder wears a suit that looks almost exactly like the museum staff's uniform, it's much harder to spot. The new system helps a little here, but it's still a tough challenge.

Summary

This paper is a call to action for the AI community: Stop testing AI in isolation.

If you want an AI that is safe to use in the real world (like for self-driving cars or medical diagnosis), you can't just test if it's smart. You have to test if it knows when it's confused and when it's looking at something completely foreign.

They provided:

  1. A new rulebook (Double Scoring) to test AI fairly.
  2. New scorecards (DS-F1 and DS-AURC) to measure reliability.
  3. A new champion guard (SURE+) that actually passes the test.

By using this new framework, we can finally build AI systems that don't just guess confidently, but know when to say, "I'm not sure, please check this."