Application of deep learning and explainable… — Plain-Language Explanation

Original authors: Sumer, O., Huber, T., Cheng, J., Duong, D., Ledgister Hanchard, S. E., Conati, C., Andre, E., Solomon, B. D., Waikel, R. L.

Published 2026-03-12

📖 4 min read☕ Coffee break read

View on medRxiv ↗PDF ↗

CC0 1.0

Original authors: Sumer, O., Huber, T., Cheng, J., Duong, D., Ledgister Hanchard, S. E., Conati, C., Andre, E., Solomon, B. D., Waikel, R. L.

Original paper dedicated to the public domain under CC0 1.0 (https://creativecommons.org/publicdomain/zero/1.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a master detective trying to solve a very tricky case: identifying a rare genetic condition just by looking at a person's face. These conditions are like rare fingerprints; they have specific patterns (like the shape of the nose, eyes, or mouth) that only a trained expert usually notices.

Now, imagine you have a super-smart robot assistant (Artificial Intelligence) that can also look at these faces and guess the diagnosis. But here's the catch: sometimes the robot is right, and sometimes it's confidently wrong.

This paper is about a big experiment to see if giving the human detectives two different types of help from the robot makes them better at solving the case.

The Two Types of Help

The researchers tested two groups of medical experts (geneticists):

The "Scoreboard" Group (AI-Only): These detectives were shown the face and the robot's guess, along with a confidence score (e.g., "I am 90% sure this is Syndrome A"). It's like the robot whispering, "I think it's this, and I'm pretty sure."
The "Highlighter" Group (XAI-Supported): These detectives got the same score, plus a visual "highlighter" (called a Saliency Map). This map glows on the parts of the face the robot thinks are important. It's like the robot pointing a laser pointer at the nose and saying, "Look here! The nose shape is what made me think it's Syndrome A." They also got a simple chart summarizing which features mattered most.

The Experiment: What Happened?

The researchers showed 18 different faces to 44 experts. Half the time, the robot was right. Half the time, the robot was wrong.

1. When the Robot was Right:
Both groups got a little boost. Seeing the robot's guess made the experts more confident and slightly more likely to agree with the correct answer. It was like having a co-pilot confirm your navigation; you feel safer and stick to the right path.

2. When the Robot was Wrong:
This is where things got interesting.

The Scoreboard Group: When the robot said, "I'm 90% sure it's Syndrome A" (but it was actually Syndrome B), the experts often got confused and changed their correct answer to the wrong one. They trusted the robot's confidence too much.
The Highlighter Group: The "Highlighter" group was also confused by the wrong robot, but the glowing map didn't help them fix the mistake. In fact, the map sometimes made them more critical of the robot, but since they couldn't go ask a patient for more info during the test, they were stuck.

The Big Surprise: The "Highlighter" Didn't Work

The researchers expected that seeing where the robot was looking (the highlighter) would help the experts understand the robot better and make smarter decisions. They thought it would be like a teacher showing you the steps to solve a math problem.

But it didn't work that way.

The experts generally found the "highlighter" maps confusing or unhelpful.
They didn't trust the glowing spots. Sometimes the robot highlighted the wrong part of the face, or the experts didn't know what the highlighted spot meant.
The experts relied much more on the confidence score (the "Scoreboard") than on the visual explanation. If the robot said "90% sure," they listened. If the robot showed a map, they mostly ignored it or found it distracting.

The Takeaway: Why This Matters

Think of it like a GPS in your car.

AI-Only is the GPS saying, "Turn left in 500 feet."
XAI (The Highlighter) is the GPS trying to explain why it wants you to turn left by showing a complex map of traffic patterns and road construction.

The study found that when the GPS is right, the simple instruction is enough. But when the GPS is wrong, showing you the complex map doesn't help you realize it's wrong; in fact, it might just make you doubt yourself or get frustrated.

The Main Lesson:
Giving doctors "explanations" (like heat maps) isn't automatically helpful. In fact, if the explanation is hard to understand or doesn't match what the doctor expects, it can actually get in the way. The doctors trusted the robot's "gut feeling" (the probability score) more than its "reasoning" (the visual map).

What's Next?
The researchers conclude that we need to build better ways to explain AI. Instead of just showing a glowing map, we might need to explain it in plain language (e.g., "The robot thinks this is Syndrome A because the eyes are wide-set, which is a key feature"). Until then, doctors should be careful not to blindly trust the robot, even when it gives a confidence score.

1. Problem Statement

Rare genetic diseases often present with distinctive facial features (facial dysmorphology), making facial phenotyping a critical tool for diagnosis. However, individual rare diseases are extremely uncommon, making diagnosis challenging even for specialists. While Deep Learning (DL) models have shown promise in automating this task, their integration into clinical workflows is hindered by the "black box" nature of AI.

The Core Issue: Previous studies suggest that AI tools do not always improve human diagnostic reasoning and may even degrade performance if clinicians blindly trust incorrect AI predictions.
The Gap: There is limited evidence on whether Explainable AI (XAI)—specifically saliency maps and region relevance scores—can effectively bridge the gap between AI predictions and clinician decision-making, thereby improving diagnostic accuracy and confidence in medical genetics.

2. Methodology

Study Design

The authors conducted a controlled user study involving 44 board-certified or board-eligible medical geneticists. Participants were divided into two groups:

AI-only Group (n=23): Received the AI model's prediction probability.
XAI-supported Group (n=21): Received the prediction probability plus XAI explanations.

Data and Stimuli

Images: 18 facial images were selected from a larger dataset of 3,547 images. These included 5 genetic syndromes (22q11.2 deletion, Angelman, Kabuki, Noonan, Williams) and unaffected individuals.
Selection Criteria: Images were chosen to represent varying levels of difficulty and included a mix of cases where the AI model was correctly classified and incorrectly classified (to test robustness against AI errors).
Demographics: Images covered pediatric patients (newborn to 18 years), various ancestries, and both sexes.

AI Models and XAI Techniques

Classifier: A ResNet-50 model was trained to predict probabilities for six labels (5 syndromes + unaffected).
XAI Explanations:
- Saliency Maps: Generated using DeepLIFT to visually highlight areas of interest (AOIs) contributing to the prediction.
- Region Relevance Scores: A novel quantitative summary of the saliency maps, aggregating contributions into three specific facial AOIs known to be critical for geneticists: nose, eyes, and mouth.

Experimental Procedure

Participants performed a two-stage diagnostic task for each image:

Baseline: View the image and classify the syndrome + rate confidence (5-point Likert scale).
Intervention: View the image again with the AI output (Probability only for Group 1; Probability + Saliency Maps + Region Scores for Group 2). They re-classified, re-rated confidence, and rated the usefulness of the AI/XAI tools.

Statistical Analysis

Metrics: Changes in diagnostic accuracy and confidence levels were measured.
Tests: Parametric (T-test) and non-parametric (Mann–Whitney U) tests were used based on normality assumptions.
Mediation Analysis: Conducted to understand the causal pathways between AI probability, user confidence, and the decision to follow the AI's recommendation.

3. Key Results

Diagnostic Accuracy

When AI was Correct: Both groups showed significant improvement in accuracy when the AI prediction was correct.
- AI-only improvement: $0.20 \pm 0.13$
- XAI improvement: $0.19 \pm 0.13$
- Conclusion: XAI did not provide a statistically significant advantage over AI-only when the model was right.
When AI was Incorrect: Both groups experienced a significant decrease in accuracy, indicating that participants were misled by incorrect high-probability predictions.
- AI-only decrease: $-0.20 \pm 0.22$
- XAI decrease: $-0.21 \pm 0.23$
- Conclusion: XAI explanations failed to help clinicians detect and correct the AI's errors.

Confidence and Perceived Usefulness

Confidence: Confidence increased when AI was correct and decreased when AI was incorrect for both groups.
Usefulness Ratings:
- Prediction Probability: Viewed as generally useful (Positive scores: ~0.6 to 0.7).
- XAI Explanations: Viewed as unhelpful or neutral (Negative scores: ~-0.14 to -0.19).
- Approximately 25-27% of participants rated saliency maps and region relevance scores as the lowest possible score (-2).
Qualitative Feedback: Many clinicians reported that XAI tools increased cognitive load without providing actionable insights. Some noted they did not know how to interpret the maps.

Mediation Analysis

The decision to "Follow AI" was significantly mediated by the model's prediction probability and the user's initial confidence.
Crucially, the presence of XAI (saliency maps/relevance scores) had no significant direct or indirect effect on the decision to follow the AI. Clinicians relied on the probability score and their own intuition rather than the visual explanations.

4. Key Contributions

Empirical Evidence on XAI Limitations: The study provides robust evidence that standard saliency-based XAI methods (DeepLIFT) do not improve diagnostic accuracy or confidence for medical geneticists, nor do they effectively mitigate the negative impact of incorrect AI predictions.
Quantitative vs. Qualitative Disconnect: The research highlights a disconnect where clinicians find raw probability scores useful but find visual explanations (saliency maps) confusing or irrelevant to their clinical reasoning process.
Mediation Insights: The study identifies that clinician trust in AI is driven primarily by the prediction confidence score and their own initial assessment, rather than the interpretability of the model's internal logic.
Framework for Evaluation: It establishes a methodology for evaluating human-AI collaboration in medical genetics, separating the effects of correct vs. incorrect AI classifications.

5. Significance and Implications

Clinical Practice: Simply adding visual explanations (saliency maps) to AI diagnostic tools is insufficient for improving clinical decision-making in facial phenotyping. In fact, overconfident incorrect predictions can mislead experts even when explanations are provided.
Future XAI Development: The findings suggest that current XAI methods may not align with how medical geneticists analyze faces (e.g., focusing on specific phenotypes like "down-slanting palpebral fissures" rather than general heatmaps). Future tools may need to move toward concept-based explanations (e.g., linking features to Human Phenotype Ontology terms) or counterfactual explanations (e.g., "If the nose were different, the prediction would change").
Human-AI Collaboration: The study underscores that AI should not be viewed as a standalone diagnostician but as a decision support tool where the "black box" nature is less critical than the calibration of trust. If clinicians cannot verify why an AI is wrong, they cannot correct it.
Limitations & Future Work: The authors note limitations in sample size and the specific DL architecture (ResNet). They call for larger studies and the exploration of alternative XAI methods (e.g., Vision-Language models) that might better communicate specific phenotypic features to clinicians.

Conclusion: While AI prediction probabilities can assist geneticists when accurate, saliency-based XAI explanations currently do not enhance diagnostic performance or help clinicians correct AI errors. The integration of AI into clinical genetics requires more intuitive, clinically relevant explanation methods that align with expert diagnostic reasoning.

Application of deep learning and explainable AI-supported medical decision-making for facial phenotyping in genetic syndromes