Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

This paper proposes decomposing mutual information into a per-class vector of epistemic uncertainty to address the limitations of scalar metrics in safety-critical classification, demonstrating through theoretical analysis and empirical validation on medical and image benchmarks that this approach significantly improves selective prediction, out-of-distribution detection, and noise robustness by revealing class-specific ignorance that scalar measures obscure.

Mame Diarra Toure, David A. Stephens

Published 2026-02-27
📖 5 min read🧠 Deep dive

Imagine you are a doctor using an AI to diagnose eye diseases from retinal scans. The AI is usually very good, but sometimes it gets confused.

In the world of standard AI, when the model is confused, it gives you a single number: "I am 30% unsure."

This is like a weatherman saying, "There is a 30% chance of rain." It tells you how much uncertainty exists, but it doesn't tell you what is causing the confusion. Is the model unsure if it's a sunny day or a cloudy day? Or is it unsure if it's a sunny day or a tornado?

In safety-critical fields like medicine, this distinction is life-or-death. Being unsure between "sunny" and "cloudy" is fine. Being unsure between "healthy" and "tornado" (or in our case, "healthy eye" vs. "blindness") is a crisis.

This paper introduces a new way to listen to the AI. Instead of asking "How unsure are you?", it asks, "Where exactly are you unsure?"

Here is the breakdown of their solution, using simple analogies:

1. The Problem: The "Black Box" of Confusion

Current AI models use a method called Bayesian Deep Learning. They run the same image through the network many times (like asking 100 different doctors for an opinion) and look at how much they disagree.

  • The Old Way (Mutual Information): They take all that disagreement and crush it into one single number. It's like a teacher grading a student's test and just saying, "You got a C." It doesn't tell you if the student failed because they didn't know math, or because they didn't know history.
  • The Flaw: If the AI is confused between two harmless diseases, that's okay. If it's confused between a harmless disease and a deadly one, that's a disaster. The old single number treats both situations exactly the same.

2. The Solution: The "Per-Class" Breakdown

The authors created a new metric called CkC_k (think of it as a "Confusion Score" for each specific disease).

Instead of one number, the AI now gives you a vector (a list of numbers), one for every possible disease.

  • Disease A (Benign): Confusion Score: 0.1 (Low)
  • Disease B (Benign): Confusion Score: 0.2 (Low)
  • Disease C (Deadly): Confusion Score: 0.9 (High!)

Now, the doctor knows exactly where the danger lies. The AI isn't just "unsure"; it is specifically terrified of confusing a healthy eye with a blind one.

3. The Secret Sauce: The "Rare Class" Fix

There was a tricky math problem with previous attempts to do this.

  • The "Rare Class" Problem: Imagine a disease that almost never happens (like 1 in 10,000). Standard math says that if a disease is rare, the AI's "uncertainty" about it must be tiny. It's like saying, "Since this disease is so rare, the AI can't possibly be confused about it."
  • The Reality: The AI is confused about rare diseases, but standard math suppresses that signal. It's like a smoke detector that is so sensitive to smoke in the kitchen that it ignores the fire in the basement just because the basement is rarely used.

The authors fixed this with a clever mathematical "weighting" (dividing by the average probability).

  • The Analogy: Imagine you are listening to a whisper in a noisy room. If you are listening to a loud shout, you don't need to turn up the volume. But if you are listening to a whisper (a rare disease), you need to amplify it.
  • Their formula acts like a volume knob that automatically turns up the signal for rare, dangerous classes. This ensures that if the AI is confused about a rare, deadly disease, the alarm goes off loud and clear, even if that disease is very rare.

4. The "Skewness" Check: Knowing When to Trust the AI

The authors also realized that sometimes the math they used to create these scores gets a little wobbly, especially when the AI is really confused.

  • The Analogy: Imagine you are estimating the height of a building. If the building is short, a simple ruler works. If the building is a skyscraper, a simple ruler might break or give a weird answer.
  • They added a "Skewness Diagnostic" (a little warning light). If the AI is confused in a weird, lopsided way, this light turns on, telling the doctor: "Hey, my usual calculation might be off; let's use a backup plan."

5. Real-World Results: Saving Sight

They tested this on Diabetic Retinopathy (a leading cause of blindness).

  • The Goal: The AI should be allowed to make decisions on "safe" cases but should defer (pass the patient to a human doctor) when it is unsure about "dangerous" cases.
  • The Result: Using their new "Per-Class" method, the system reduced the risk of missing a dangerous case by 35% compared to the old methods.
  • The Metaphor: The old system was like a security guard who lets everyone through unless the whole building is on fire. The new system is a guard who knows exactly which door leads to the fire and blocks only that door, letting everyone else pass safely.

Summary

This paper teaches us that knowing where you are ignorant is just as important as knowing how much you are ignorant.

By breaking down uncertainty into specific categories and giving extra attention to rare, dangerous ones, we can build AI systems that are not just smart, but also safe. It's the difference between a car that says "I might crash" and a car that says "I might crash into that specific pedestrian, so I'm hitting the brakes."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →