Learning Credal Ensembles via Distributionally Robust Optimization

The Big Picture: Why Do We Need This?

Imagine you are a doctor looking at an X-ray. You are 99% sure the patient has a broken bone. But then, you realize the X-ray machine is old and blurry, and you've never seen this specific type of fracture before.

In the world of Artificial Intelligence (AI), this is called Uncertainty. There are two types:

Noise (Aleatoric Uncertainty): The X-ray is just blurry. Even a perfect doctor couldn't be 100% sure. This is unavoidable.
Ignorance (Epistemic Uncertainty): The doctor doesn't know because they haven't seen this specific case before. This is the dangerous kind. If the AI is confident but wrong, it could make a life-threatening mistake.

Current AI methods are good at measuring the "blurry" noise, but they are terrible at measuring "ignorance." They often think they are smart just because they trained on a lot of data, even if that data doesn't match the real world.

The Problem with Current Methods: The "Dice Roll" Approach

Most state-of-the-art AI models try to measure ignorance by training the same model multiple times with slightly different random starting points (like rolling dice to decide where to start).

The Analogy: Imagine you are trying to guess the weather in a new city.

Current Method: You ask 10 friends to guess the weather. But, you tell them to close their eyes and spin around before looking out the window.
The Result: Your friends give different answers. You say, "Wow, there is high uncertainty!"
The Flaw: The uncertainty isn't because the weather is weird; it's because your friends were dizzy from spinning! The AI is measuring its own confusion about how to start training, not its confusion about the real world.

The Solution: CreDRO (The "Stress-Test" Approach)

The authors propose a new method called CreDRO. Instead of spinning their friends around, they put them in different, slightly stressful environments to see how they react.

The Analogy:
Imagine you are training a team of pilots.

Old Way: You have them fly the same route 10 times, but you change the wind direction randomly each time just to see how they handle it.
CreDRO Way: You tell Pilot A, "Fly assuming the wind is calm." You tell Pilot B, "Fly assuming the wind is a light breeze." You tell Pilot C, "Fly assuming a hurricane is coming."
The Result: If all pilots agree the plane is safe, you are confident. But if Pilot A says "Safe" and Pilot C says "Crash imminent," you know there is a real problem. You don't know which scenario is true, so you admit you are uncertain.

How CreDRO Works (The Technical Magic)

The paper uses a technique called Distributionally Robust Optimization (DRO).

The "What If" Game: During training, the AI doesn't just look at the data as it is. It asks, "What if the data I'm seeing is slightly different from the data I'll see in the real world?"
The Stress Test: It creates a "worst-case scenario" for the training data. It forces the AI to learn to handle data that is slightly "off" or "shifted."
The Ensemble: It trains a group of models (an ensemble). Each model is trained to handle a different level of "off-ness."
- Model 1: Handles data that is 5% different.
- Model 2: Handles data that is 10% different.
- Model 3: Handles data that is 20% different.
The "Box" of Answers: When the AI makes a prediction, instead of giving one single number (e.g., "80% chance of rain"), it gives a range (e.g., "Between 40% and 90% chance").
- If the range is small (40% to 42%), the AI is confident.
- If the range is huge (10% to 90%), the AI is admitting, "I don't know what's going on here."

Why Is This Better?

The paper tested CreDRO against the best existing methods in two main areas:

Spotting the "Weird" Stuff (Out-of-Distribution Detection):
- Scenario: You train a model on pictures of cats and dogs. Then you show it a picture of a toaster.
- Old AI: Might confidently say, "That's a very strange cat!"
- CreDRO: Says, "I have no idea what that is. My confidence range is huge. Please don't trust me."
- Result: CreDRO was much better at spotting that the input was weird and didn't belong in its training data.
Medical Safety (Selective Classification):
- Scenario: A doctor uses AI to diagnose cancer.
- CreDRO: If the AI is unsure, it can say, "I reject this case. Please have a human look at it."
- Result: By letting the AI "reject" the hard cases, the overall accuracy of the system went up because the AI only made predictions when it was actually sure.

The Takeaway

CreDRO is like a safety net for AI. Instead of asking, "How confused are you because you started training differently?" it asks, "How confused are you because the real world might be different from your training data?"

By training the AI to expect the unexpected, it learns to admit when it doesn't know the answer. This makes AI safer, more reliable, and much more trustworthy for critical jobs like medicine, self-driving cars, and finance.

In short: CreDRO stops the AI from bluffing. If it's unsure, it raises its hand and says, "I need help," rather than guessing and hoping for the best.

1. Problem Statement

Deep neural networks (DNNs) require reliable Uncertainty Quantification (UQ), particularly distinguishing between:

Aleatoric Uncertainty (AU): Inherent noise in the data generation process.
Epistemic Uncertainty (EU): Uncertainty arising from the model's limited knowledge, often due to distribution shifts or lack of data.

Current Limitations:

State-of-the-Art (SOTA) Credal Methods: Most existing methods (e.g., Deep Ensembles, Credal Wrappers) quantify EU based on disagreement induced by random training initializations. This primarily reflects sensitivity to optimization randomness rather than substantive uncertainty regarding potential shifts between training and test distributions.
Bayesian Neural Networks (BNNs) & Evidential Deep Learning (EDL): These face scalability issues, high computational costs, or have been criticized for failing to faithfully represent EU (e.g., confidence bounds differing significantly from reference distributions).
The Gap: There is a need for a method that captures EU arising from informative disagreement caused by potential train-test distribution shifts, not just random initialization noise.

2. Methodology: CreDRO

The authors propose CreDRO (Credal Distributionally Robust Optimization), a framework that learns an ensemble of plausible models by varying the degree of relaxation of the i.i.d. assumption between training and test distributions.

A. Training Procedure (Distributionally Robust Optimization)

Instead of training models with identical assumptions, CreDRO trains an ensemble where each member assumes a different level of potential distribution shift.

Adversarially Reweighted Learning (ARL): The method adopts a group DRO framework. It assumes the test distribution lies within a neighborhood of the training distribution.
Loss Weighting Strategy:
- For each training batch, samples are sorted by loss.
- Only the top $\delta$ portion of samples (highest loss) are used for backpropagation.
- High-loss samples are implicitly assigned weights $w_n > 1$ , while others are $w_n = 0$ . This simulates a scenario where the test distribution differs from the training distribution (e.g., focusing on minority groups or hard-to-learn instances).
Ensemble Construction:
- A global hyperparameter $\delta_G \in [0.5, 1)$ defines the worst-case divergence.
- For an ensemble of size $M$ , each member $i$ is trained with a specific $\delta_i$ calculated via uniform interpolation:
  $\delta_i = (1 - \delta_G) \frac{i-1}{M-1} + \delta_G$
- This creates a diverse set of models, each specialized in a different degree of distribution shift, rather than just random initialization.

B. Credal Prediction Generation

At inference time, CreDRO transforms the ensemble's outputs into a Box Credal Set:

Probability Intervals: For each class $k$ , the ensemble's softmax probabilities $\{p_{i,k}\}_{i=1}^M$ are aggregated to form an interval $[\underline{p}_k, \overline{p}_k]$ :
$\overline{p}_k = \max_i p_{i,k}, \quad \underline{p}_k = \min_i p_{i,k}$
Box Credal Set ( $K_B$ ): The prediction is a convex set of probability vectors constrained by these intervals:
$K_B = \{ p \mid p_k \in [\underline{p}_k, \overline{p}_k], \sum p_k = 1 \}$
Note: The paper argues that Box Credal Sets are computationally more efficient for EU quantification than Convex Hull sets ( $K_C$ ) and empirically perform better.

C. Uncertainty Quantification

EU is quantified by the difference between the upper and lower Shannon entropy of the credal set:
$\text{EU} = H(K_B) - H(K_B)$

Upper Entropy: Maximizes entropy over the feasible region defined by the intervals.
Lower Entropy: Minimizes entropy over the same region.
These optimizations are solved efficiently using standard solvers (e.g., SciPy).

3. Key Contributions

Novel Formulation of EU: Redefines Epistemic Uncertainty as disagreement among models trained under varying degrees of i.i.d. assumption relaxation, capturing uncertainty from potential distribution shifts rather than just optimization randomness.
CreDRO Framework: Introduces a training strategy using Distributionally Robust Optimization (DRO) with a tunable hyperparameter ( $\delta_G$ ) to generate diverse, plausible models without architectural changes (unlike Credal Deep Ensembles which require modified output layers).
Box Credal Sets: Proposes using Box Credal Sets for inference, demonstrating they offer a better trade-off between computational efficiency and OOD detection performance compared to Convex Hull sets.
Comprehensive Evaluation: Extensive experiments showing CreDRO outperforms SOTA credal classifiers (CreDE, CreWra, CreEns, CreRL) and standard Deep Ensembles.

4. Experimental Results

The paper validates CreDRO on multiple benchmarks:

Out-of-Distribution (OOD) Detection:
- Datasets: CIFAR-10 (ID) vs. SVHN, Places365, CIFAR-100, FMNIST, ImageNet (OOD).
- Metric: AUROC (Area Under Receiver Operating Characteristic).
- Result: CreDRO consistently achieved the highest AUROC scores across all datasets and ensemble sizes (M=5 to M=20). For example, on CIFAR-10 vs. SVHN, CreDRO achieved 97.4% vs. 94.8% for standard Deep Ensembles.
Corrupted Data (CIFAR-C):
- Tested on CIFAR-10/100-C with 15 corruption types. CreDRO showed significant and consistent improvements in OOD detection under increasing corruption intensities compared to baselines.
Selective Classification (Medical Setting):
- Dataset: Camelyon17 (Histopathology images with scanner domain shifts).
- Metric: Accuracy-Rejection (AR) curve and AUC.
- Result: CreDRO achieved the best performance, maintaining high accuracy while rejecting uncertain samples. In contrast, baselines like CreDE showed unreliable EU estimates (accuracy decreased as rejection increased).
Efficiency:
- CreDRO has comparable training and inference times to other credal methods.
- It is lighter than CreDE (which requires doubled output neurons) and faster in UQ runtime than methods using Convex Hull sets.

5. Significance and Impact

Robustness to Distribution Shift: By explicitly modeling potential train-test distribution shifts during training, CreDRO provides more reliable uncertainty estimates in safety-critical applications (e.g., medical diagnosis, autonomous driving) where data distribution shifts are common.
Practicality: Unlike BNNs or modified architectures, CreDRO works with standard neural network architectures and standard loss functions, making it easy to integrate into existing pipelines.
Theoretical Advancement: It bridges the gap between Distributionally Robust Optimization (typically used for improving worst-case accuracy) and Uncertainty Quantification, showing that DRO can be used to generate high-quality epistemic uncertainty estimates.

In summary, CreDRO represents a significant step forward in Epistemic Uncertainty Quantification by moving beyond random initialization noise to capture structural uncertainty arising from distributional shifts, offering a robust, efficient, and state-of-the-art solution for reliable AI systems.