Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

Imagine you are a doctor using an AI to diagnose eye diseases from retinal scans. The AI is usually very good, but sometimes it gets confused.

In the world of standard AI, when the model is confused, it gives you a single number: "I am 30% unsure."

This is like a weatherman saying, "There is a 30% chance of rain." It tells you how much uncertainty exists, but it doesn't tell you what is causing the confusion. Is the model unsure if it's a sunny day or a cloudy day? Or is it unsure if it's a sunny day or a tornado?

In safety-critical fields like medicine, this distinction is life-or-death. Being unsure between "sunny" and "cloudy" is fine. Being unsure between "healthy" and "tornado" (or in our case, "healthy eye" vs. "blindness") is a crisis.

This paper introduces a new way to listen to the AI. Instead of asking "How unsure are you?", it asks, "Where exactly are you unsure?"

Here is the breakdown of their solution, using simple analogies:

1. The Problem: The "Black Box" of Confusion

Current AI models use a method called Bayesian Deep Learning. They run the same image through the network many times (like asking 100 different doctors for an opinion) and look at how much they disagree.

The Old Way (Mutual Information): They take all that disagreement and crush it into one single number. It's like a teacher grading a student's test and just saying, "You got a C." It doesn't tell you if the student failed because they didn't know math, or because they didn't know history.
The Flaw: If the AI is confused between two harmless diseases, that's okay. If it's confused between a harmless disease and a deadly one, that's a disaster. The old single number treats both situations exactly the same.

2. The Solution: The "Per-Class" Breakdown

The authors created a new metric called $C_k$ (think of it as a "Confusion Score" for each specific disease).

Instead of one number, the AI now gives you a vector (a list of numbers), one for every possible disease.

Disease A (Benign): Confusion Score: 0.1 (Low)
Disease B (Benign): Confusion Score: 0.2 (Low)
Disease C (Deadly): Confusion Score: 0.9 (High!)

Now, the doctor knows exactly where the danger lies. The AI isn't just "unsure"; it is specifically terrified of confusing a healthy eye with a blind one.

3. The Secret Sauce: The "Rare Class" Fix

There was a tricky math problem with previous attempts to do this.

The "Rare Class" Problem: Imagine a disease that almost never happens (like 1 in 10,000). Standard math says that if a disease is rare, the AI's "uncertainty" about it must be tiny. It's like saying, "Since this disease is so rare, the AI can't possibly be confused about it."
The Reality: The AI is confused about rare diseases, but standard math suppresses that signal. It's like a smoke detector that is so sensitive to smoke in the kitchen that it ignores the fire in the basement just because the basement is rarely used.

The authors fixed this with a clever mathematical "weighting" (dividing by the average probability).

The Analogy: Imagine you are listening to a whisper in a noisy room. If you are listening to a loud shout, you don't need to turn up the volume. But if you are listening to a whisper (a rare disease), you need to amplify it.
Their formula acts like a volume knob that automatically turns up the signal for rare, dangerous classes. This ensures that if the AI is confused about a rare, deadly disease, the alarm goes off loud and clear, even if that disease is very rare.

4. The "Skewness" Check: Knowing When to Trust the AI

The authors also realized that sometimes the math they used to create these scores gets a little wobbly, especially when the AI is really confused.

The Analogy: Imagine you are estimating the height of a building. If the building is short, a simple ruler works. If the building is a skyscraper, a simple ruler might break or give a weird answer.
They added a "Skewness Diagnostic" (a little warning light). If the AI is confused in a weird, lopsided way, this light turns on, telling the doctor: "Hey, my usual calculation might be off; let's use a backup plan."

5. Real-World Results: Saving Sight

They tested this on Diabetic Retinopathy (a leading cause of blindness).

The Goal: The AI should be allowed to make decisions on "safe" cases but should defer (pass the patient to a human doctor) when it is unsure about "dangerous" cases.
The Result: Using their new "Per-Class" method, the system reduced the risk of missing a dangerous case by 35% compared to the old methods.
The Metaphor: The old system was like a security guard who lets everyone through unless the whole building is on fire. The new system is a guard who knows exactly which door leads to the fire and blocks only that door, letting everyone else pass safely.

Summary

This paper teaches us that knowing where you are ignorant is just as important as knowing how much you are ignorant.

By breaking down uncertainty into specific categories and giving extra attention to rare, dangerous ones, we can build AI systems that are not just smart, but also safe. It's the difference between a car that says "I might crash" and a car that says "I might crash into that specific pedestrian, so I'm hitting the brakes."

1. Problem Statement

In safety-critical classification tasks (e.g., medical diagnosis, autonomous driving), the cost of failure is often asymmetric. Missing a rare, critical condition (e.g., sight-threatening retinopathy) is far more severe than a false positive or confusing two benign classes.

Current Bayesian deep learning methods summarize epistemic uncertainty (model ignorance) using a single scalar metric: Mutual Information (MI). While MI effectively quantifies how much the model is uncertain, it fails to answer where that uncertainty lies. A scalar MI value cannot distinguish between:

Confusion between two benign classes (low risk).
Confusion between a benign class and a safety-critical class (high risk).

Existing attempts to decompose uncertainty per-class suffer from boundary suppression: raw variance ( $\text{Var}[p_k]$ ) is mathematically constrained to vanish as the mean probability $\mu_k$ approaches 0 or 1. This makes raw variance unreliable for rare classes (low $\mu_k$ ), precisely where safety-critical decisions are most needed.

2. Methodology

The authors propose a new metric, $C_k(x)$ , which decomposes the total epistemic uncertainty (MI) into a vector of per-class contributions.

A. Theoretical Derivation

The method is derived from a second-order Taylor expansion of the Shannon entropy $H(p)$ around the mean prediction $\mu$ .

Entropy Expansion: The expected entropy $E[H(p)]$ is approximated as:
$E[H(p)] \approx H(\mu) - \frac{1}{2} \sum_{k=1}^K \frac{\text{Var}[p_k]}{\mu_k}$
Decomposition of MI: Since Mutual Information is defined as $I(y; \omega | x) = H(\mu) - E[H(p)]$ , the expansion yields:
$I(y; \omega | x) \approx \sum_{k=1}^K C_k(x)$
where the per-class epistemic uncertainty is defined as:
$C_k(x) = \frac{1}{2} \frac{\text{Var}[p_k](x)}{\mu_k(x)}$

B. Key Mechanisms

Boundary Correction: The term $1/\mu_k$ acts as a curvature weight derived from the Hessian of the entropy function. It counteracts boundary suppression. As $\mu_k \to 0$ , the variance bound $\text{Var}[p_k] \le \mu_k(1-\mu_k)$ forces raw variance to zero. However, the ratio $\text{Var}[p_k]/\mu_k$ approaches a non-zero constant ( $\approx 1/2$ ), allowing the metric to retain sensitivity for rare, critical classes.
Additivity: By construction, the sum of all $C_k$ approximates the total scalar MI, ensuring the decomposition is consistent with established uncertainty theory.

C. Reliability Diagnostic (Skewness)

The authors acknowledge that the Taylor expansion relies on the posterior distribution being approximately symmetric. They introduce a skewness diagnostic $\rho_k$ :
$\rho_k(x) = \frac{|m_{3,k}|}{3 \mu_k \cdot \text{Var}[p_k]}$
where $m_{3,k}$ is the third central moment. If $\rho_k$ is high (e.g., $>0.3$ ), the second-order approximation degrades. In such cases, the authors propose a fallback metric, CBEC (Cross-Boundary Epistemic Confusion), which uses empirical correlations between safe and critical classes rather than the Taylor approximation.

3. Key Contributions

Per-Class Decomposition ( $C_k$ ): A novel, normalized per-class epistemic uncertainty vector that is additive to MI and corrects for boundary suppression in rare classes.
Axiomatic Analysis: The authors prove that while $C_k$ violates the "location-shift invariance" axiom (A5), this violation is a necessary feature to prevent boundary suppression, making it superior for safety-critical settings where class base rates vary.
Diagnostic Framework: Introduction of a skewness diagnostic ( $\rho_k$ ) to flag when the approximation is unreliable and a robust fallback metric (CBEC) for those scenarios.
Empirical Validation: Comprehensive evaluation across three distinct domains: selective prediction, out-of-distribution (OoD) detection, and uncertainty disentanglement.

4. Experimental Results

A. Selective Prediction (Diabetic Retinopathy)

Task: Grading retinal images where missing severe disease (Grade 3) is catastrophic.
Result: Using $C_{crit\_max}$ (the maximum $C_k$ among critical classes) as a deferral policy reduced the Selective Risk (AUSC) by 34.7% compared to standard MI and 56.2% compared to raw variance baselines.
Insight: The method successfully identified "catastrophic misses" (predicting Grade 3 as Grade 0) by concentrating epistemic mass on the intermediate class (Grade 2), a pattern invisible to scalar MI.

B. Out-of-Distribution (OoD) Detection

Tasks: FashionMNIST $\to$ KMNIST (images) and MIMIC-III ICU $\to$ Newborn (tabular).
Result: The aggregate $\sum C_k$ achieved the highest AUROC on both datasets, outperforming MI and raw variance.
Insight: The per-class view revealed asymmetric distributional shifts. For example, in the medical task, the shift was driven primarily by the "survival" class rather than the "mortality" class, a nuance lost in scalar metrics.

C. Uncertainty Disentanglement (Label Noise)

Task: Measuring how well epistemic uncertainty estimates remain insensitive to injected aleatoric noise (label noise).
Result: Under end-to-end Bayesian training, $\sum C_k$ showed less sensitivity to noise than MI (better disentanglement). However, under transfer learning (frozen backbone), both metrics degraded significantly, with entanglement increasing by an order of magnitude.
Insight: The quality of the posterior approximation (end-to-end vs. transfer) impacts uncertainty metrics as much as the metric choice itself.

5. Significance and Implications

Safety-Critical Decision Making: The paper demonstrates that knowing which class is uncertain is as important as knowing how much uncertainty exists. This allows for targeted deferral strategies (e.g., deferring only when a critical class is confused with a safe one).
Theoretical Rigor: It provides a mathematically grounded solution to the "boundary suppression" problem that plagues variance-based uncertainty estimates in deep learning.
Practical Guidance: The study highlights that how uncertainty is propagated through the network (e.g., end-to-end training vs. transfer learning) is a critical factor often overlooked. It suggests that post-hoc Bayesian methods on frozen backbones may produce unreliable uncertainty estimates regardless of the metric used.
Interpretability: The decomposition offers a new "fingerprint" for model errors, enabling developers to diagnose specific confusion patterns (e.g., "the model is unsure if this is Grade 2 or Grade 3") rather than just seeing a generic "high uncertainty" flag.

In summary, this work moves Bayesian deep learning from a "black box" scalar uncertainty estimate to a granular, interpretable, and safety-aware framework, proving that location of uncertainty is a vital dimension for reliable AI in high-stakes environments.