Multi-criterion uncertainty estimation improves skin… — Plain-Language Explanation

Original authors: Schreyer, W. M., Samathan, R., Berry, E., Thompson, R. F.

Published 2026-02-27

📖 5 min read🧠 Deep dive

Original authors: Schreyer, W. M., Samathan, R., Berry, E., Thompson, R. F.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a master chef who has spent years perfecting a recipe for a specific type of soup using only the freshest, most perfectly uniform vegetables from a single, high-end farm (let's call this the HAM10000 Farm). You are so good at this that you can predict exactly how the soup will taste with 99% accuracy.

Now, imagine you want to open a soup kitchen for the whole world. But instead of your perfect farm, you start getting vegetables from:

A dusty roadside stand in Argentina.
A rainy garden in Brazil.
A chaotic street market in the US.

The vegetables are real, but they are different sizes, covered in dirt, sometimes wrapped in plastic, and some are even different types of vegetables entirely (like a pumpkin instead of a carrot). If you try to cook your "perfect farm" soup with these new ingredients, the result might be a disaster. The soup could taste weird, or worse, you might accidentally serve something toxic because you didn't realize the ingredients were different.

This is exactly the problem doctors and AI face with skin cancer detection.

The Problem: The "Perfect Farm" Trap

For years, AI models have been trained on "perfect" datasets of skin images (like the HAM10000 dataset). These images are clear, well-lit, and taken with special medical cameras (dermoscopes). The AI learns to spot cancer on these perfect images very well.

But in the real world, photos are messy. They are taken with regular smartphones, in bad lighting, with hair covering the spot, or with rulers and markers in the picture. When the AI sees these "messy" photos, it often gets confused. It might think a harmless mole is cancer, or miss a real cancer because the photo looked "weird" to the computer. This is called Distribution Shift—the data the AI sees in the real world doesn't match the data it studied in school.

The Solution: The "SAGE" Quality Inspector

The authors of this paper created a new tool called SAGE (Supervised Autoencoders for Generalization Estimates). Think of SAGE not as a doctor who diagnoses the disease, but as a strict quality control inspector standing at the door of the hospital.

Here is how SAGE works, using a simple analogy:

The Three-Point Check: When a new photo arrives, SAGE doesn't just look at the picture; it runs three quick tests to see if the photo "feels" like the photos the AI studied in school:
- The "Shape" Test (Reconstruction): SAGE tries to redraw the image from memory. If it can't redraw it well, the image is weird (maybe it's blurry or has a ruler in it).
- The "Neighbor" Test (Distance): SAGE checks if this image is hanging out with its "friends" (the training data). If the image is standing alone in a corner of the room, it's an outsider.
- The "Confidence" Test: SAGE asks the AI, "Are you sure about this?" If the AI is shaking in its boots, SAGE flags it.
The "SAGE Score": SAGE combines these three tests into a single score.
- Low Score: "This photo looks just like the ones we studied. It's safe to let the AI diagnose it."
- High Score: "Whoa! This photo has a ruler in it, the lighting is weird, or it's a type of lesion we've never seen. Stop! Do not let the AI make a diagnosis on this one."

What They Found

The researchers tested this system on photos from five different countries (Argentina, Brazil, Austria, Turkey, and the US). Here is what they discovered:

The "Messy" Photos: Photos taken with regular smartphones often had high SAGE scores because of things like camera flashes, hair, or rulers. The AI was much less reliable on these.
The "Dark Skin" Gap: The AI struggled more with darker skin tones, partly because the training data didn't have enough dark skin, and partly because the lighting on dark skin often creates "weird" artifacts that confuse the AI. SAGE successfully flagged these difficult images.
The "New Disease" Problem: The AI was terrible at spotting rare skin cancers it had never seen before (like T-cell lymphoma). Interestingly, the AI was too confident about these new diseases, thinking they were common ones. SAGE, however, correctly flagged them as "Out of Distribution" (strangers) because they didn't look like the training data.

The Result: A Safer Kitchen

By using SAGE to filter out the "bad" photos before the AI makes a diagnosis, the researchers showed that the AI became more accurate.

Before SAGE: The AI tried to diagnose everything, including the messy photos, and made mistakes.
After SAGE: SAGE threw out the confusing photos. The AI only looked at the "good" photos, and its accuracy went up.

Why This Matters

This paper is like giving the AI a pair of glasses that helps it realize when it is out of its depth.

Instead of blindly trusting an AI to diagnose skin cancer from a random smartphone photo, SAGE acts as a safety net. It tells doctors: "Hey, this photo is too weird for the AI to handle safely. Please, a human doctor should look at this one."

This is crucial for health equity. It ensures that the AI doesn't accidentally fail patients with darker skin or those in rural areas who use smartphones, by catching the moments when the AI is likely to be wrong and handing the job back to a human expert.

In short: SAGE is the bouncer at the club who checks the ID. If the photo doesn't match the "training club" rules, SAGE says, "No entry for the AI," preventing a medical disaster.

1. Problem Statement

The rapid development of machine learning (ML) models for skin cancer diagnosis has been hindered by a significant generalization gap. While models achieve high performance on curated benchmark datasets (e.g., HAM10000), their accuracy deteriorates drastically when applied to real-world clinical data from diverse sources. This failure stems from distribution shifts caused by:

Imaging Variability: Differences in lighting, capture angles, dermatoscope types (polarized vs. non-polarized), and device quality (dermoscopes vs. smartphones).
Patient Phenotypes: Variations in skin tone (Fitzpatrick Skin Types), which are often underrepresented in training data, leading to ethical concerns and reduced accuracy for darker skin tones.
Image Artifacts: Presence of rulers, hair, flash reflections, or non-skin backgrounds.
Semantic Shifts: The introduction of new diagnostic classes not seen during training.

Current uncertainty quantification (UQ) methods often lack flexibility, fail silently under data drift, or are tightly coupled to specific prediction tasks, making them unsuitable for robust pre-clinical screening of diverse global datasets.

2. Methodology

The authors propose and implement SAGE (Supervised Autoencoders for Generalization Estimates), a novel multi-criterion uncertainty estimation framework designed to detect Out-of-Distribution (OOD) images and improve downstream malignancy prediction.

A. Model Architecture

SAGE is built upon a ResNet-50 encoder (pre-trained on ImageNet) and consists of three parallel modules:

Encoder: Compresses input images into a 256-dimensional latent embedding.
Decoder: Reconstructs the original image from the embedding (unsupervised signal).
Classifier: Predicts the diagnostic class from the embedding (supervised signal).

The model is trained in two stages on the HAM10000 dataset:

Stage 1: Training the encoder and classifier using center loss and weighted cross-entropy.
Stage 2: Joint training of the encoder, decoder, and classifier using reconstruction loss (MSE) and classification loss.

B. The SAGE Scoring Mechanism

For any new input image, SAGE computes a composite uncertainty score based on three exceedance probabilities derived from the training distribution:

$x_1$ (Latent Distance): The L1 distance to the $k=20$ nearest neighbors in the training latent space.
$x_2$ (Classifier Confidence): The softmax probability of the predicted class.
$x_3$ (Reconstruction Error): The Mean Squared Error (MSE) between the input and the reconstructed image.

The SAGE Score is calculated as the geometric mean of the exceedance probabilities ( $P(X > x)$ ) for these three measures. A higher score indicates the image is less likely to belong to the training distribution (i.e., it is OOD or low quality).

C. Experimental Setup

Datasets: The study utilized five diverse datasets:
- HAM10000: The reference training set (Australia/Austria).
- HIBA (Argentina), UFES (Brazil), DDI (USA), MILK10K (Global): Test sets with varying skin tones, imaging modalities (smartphone vs. dermoscope), and diagnostic classes.
- Caltech-101: Used for "far-OOD" detection (non-semantic shifts).
Comparison Baselines: SAGE was benchmarked against standard UQ methods: Maximum Softmax Probability (MSP), Ensemble Mutual Information (MI), and Monte Carlo Dropout Entropy.
Downstream Task: A separate binary malignancy predictor (Inception v3) was used to test if filtering images based on SAGE scores improved diagnostic accuracy.

3. Key Contributions

Novel Multi-Criterion UQ: Introduction of SAGE, which decouples uncertainty estimation from the specific prediction task, allowing it to act as a universal filter for data quality and distribution shift.
Global Generalization Analysis: A comprehensive evaluation across six countries, demonstrating how SAGE identifies specific artifacts (e.g., rulers, flash, hair) that degrade model performance, particularly for patients with darker skin tones.
Selective Prediction Framework: Demonstration that filtering images based on SAGE thresholds significantly improves the performance of downstream malignancy predictors, reducing false negatives in challenging cases.
Artifact Quantification: Manual annotation of 12 image features (e.g., contrast, hair density, background) revealed that extraneous artifacts often contribute more to uncertainty scores than intrinsic phenotypic differences like skin tone.

4. Key Results

OOD Detection Performance: SAGE outperformed all baseline methods (MSP, MI, Entropy, KNN) in detecting distribution shifts.
- Far-OOD (Caltech-101): Achieved an AUROC of 1.00.
- Combined Modality/Class Shift: Achieved an AUROC of 0.92.
- Near-OOD (Similar classes, different settings): Achieved an AUROC of 0.81.
Impact on Malignancy Prediction:
- Applying a SAGE threshold (filtering out high-score images) improved the AUROC for malignancy prediction on dark skin images from 0.68 to 0.78, surpassing the performance on light skin images (0.77).
- While this improved accuracy, it reduced coverage (only 16.42% of dark skin images passed the threshold), highlighting a trade-off between safety and inclusivity.
Artifact Correlation:
- Images with rulers, camera flash, or non-skin backgrounds showed significantly higher SAGE scores.
- The presence of a ruler increased the SAGE score by 6.11% for light skin but 9.77% for dark skin, suggesting higher sensitivity to artifacts in darker phenotypes.
New Class Detection: SAGE successfully identified previously unseen malignant classes (e.g., T-cell lymphomas, Kaposi sarcoma) as OOD, whereas the intrinsic uncertainty of the malignancy predictor failed to flag them effectively.

5. Significance and Conclusion

This study addresses a critical bottleneck in deploying AI for dermatology: the inability of models to recognize when they are operating outside their training distribution.

Clinical Safety: SAGE provides a "gatekeeper" mechanism. By filtering out images with high uncertainty (due to artifacts or distribution shifts) before they reach a diagnostic model, clinicians can reduce the risk of misdiagnosis.
Equity: The method explicitly helps identify and mitigate biases against patients with darker skin tones, who are often disproportionately affected by image artifacts and distribution shifts.
Model Cards: The authors propose SAGE as a dynamic alternative to static "model cards," allowing users to interactively probe how their specific data differs from the training distribution.

Limitations: The study notes that the training data still lacks sufficient representation of Asian and African populations. Additionally, the filtering process reduces data coverage, which must be balanced against safety requirements in clinical practice. Future work aims to expand the latent space and apply SAGE to segmentation and monitoring tasks.

In summary, SAGE offers a robust, task-agnostic tool for quantifying the reliability of skin lesion images, enabling safer and more equitable deployment of AI in global dermatology.

Multi-criterion uncertainty estimation improves skin cancer distribution shift detection and malignancy prediction