Enhancing SHAP Explainability for Diagnostic and Prognostic ML Models in Alzheimer Disease

Imagine you have a very smart, super-fast robot doctor that can look at a patient's medical records and tell you two things:

Diagnosis: "Does this person have Alzheimer's right now?"
Prognosis: "Will this person get worse in the next four years?"

This robot is incredibly accurate, but it has a problem: it's a "black box." It gives you the answer, but it won't tell you why it thinks that. It's like a friend who says, "I know you're going to win the lottery," but refuses to explain how they know. Doctors can't trust a tool they don't understand, especially when it comes to life-and-death decisions.

This paper is about teaching that robot doctor to speak human and proving that its reasoning is reliable, not just lucky.

Here is the breakdown of their work using simple analogies:

1. The Problem: The "Magic 8-Ball" Effect

The researchers used a tool called SHAP (think of it as a magnifying glass) to see which clues the robot was using. Usually, the robot points to things like "Memory Test Scores" or "Ability to Pay Bills" as the main reasons for its decision.

But there was a catch:

The "One-Off" Problem: If you asked the robot to diagnose a patient, it might say, "It's because of Memory." But if you asked it to predict the future, would it still say, "It's because of Memory"? Or would it suddenly switch to "It's because of Genetics"?
The "Fickle Friend" Problem: If the robot changes its mind about why it's making a prediction every time you tweak the data slightly, doctors can't trust it. They need to know the robot is consistent.

2. The Solution: The "Three-Point Stability Test"

The authors created a new framework to test if the robot's explanations are sturdy. They used three creative tests:

Test A: The "Internal Logic" Check (Coherence)
- Analogy: Imagine a detective who solves a crime. The detective's notebook (Feature Importance) says "The Butler did it." But when the detective explains it to the jury (SHAP), they say, "It was the Maid."
- The Test: The researchers checked if the robot's internal "notebook" matched its "explanation." They found that for the most part, the robot was honest: what it used to learn was the same thing it used to explain.
Test B: The "Same Story, Different Chapter" Check (Stability)
- Analogy: Imagine reading a mystery novel. In Chapter 1 (Early Stage), the clues point to the Butler. In Chapter 10 (Late Stage), do the clues still point to the Butler, or does the story suddenly change to a completely different suspect?
- The Test: They checked if the robot used the same "clues" (like memory or attention) whether the patient was in the early stages of confusion or the late stages of dementia.
- Result: The robot was very consistent! It kept pointing to the same cognitive clues (Memory, Judgment, Attention) regardless of how far the disease had progressed.
Test C: The "Past vs. Future" Check (Transferability)
- Analogy: A weather forecaster who says, "It's raining now because of dark clouds." If you ask, "Will it rain tomorrow?" a good forecaster should still say, "Yes, because of those same dark clouds," not suddenly say, "No, because of the wind."
- The Test: They compared the robot's reasons for diagnosing the disease today versus predicting the disease four years from now.
- Result: The reasons were almost identical. The robot didn't suddenly start relying on weird, random factors when looking into the future. It stuck to the core symptoms.

3. The Big Discovery: What Actually Matters?

Through this testing, the researchers confirmed what doctors have suspected for a long time, but now with mathematical proof:

The Real Heroes: The most important clues are Cognitive and Functional skills. Can the patient remember things? Can they pay their bills? Can they navigate a room?
The Sidekicks: Genetic markers (like DNA) and administrative details (like which language the test was taken in) played a much smaller role. They were there, but they weren't the main characters driving the decision.

4. Why This Matters for You

Think of this framework as a "Trust Seal" for AI in medicine.

Before this paper, a doctor might look at an AI result and think, "It says my patient has Alzheimer's, but I don't know if the AI is just guessing or if it's actually looking at the right symptoms."

Now, because the researchers proved that the AI's reasoning is stable (it doesn't flip-flop), coherent (its explanation matches its logic), and transferable (it works for both today and the future), doctors can finally say:

"I trust this AI. It's looking at the same real-world symptoms I am, and it's consistent. I can use this tool to help my patients."

In a Nutshell

The paper didn't just build a better robot; it built a better translator for the robot. It proved that the robot's "reasoning" isn't magic or luck—it's based on solid, consistent medical facts that doctors can understand and trust. This is a huge step toward getting AI into real hospitals to help fight Alzheimer's.

Here is a detailed technical summary of the paper "Enhancing SHAP Explainability for Diagnostic and Prognostic ML Models in Alzheimer's Disease."

1. Problem Statement

While Machine Learning (ML) models have demonstrated high accuracy in diagnosing and predicting the progression of Alzheimer's Disease (AD), their adoption in clinical settings is hindered by the "black-box" nature of complex algorithms. Clinicians require trustworthy, consistent, and transparent explanations to integrate these tools into workflows.

Current research primarily relies on SHAP (SHapley Additive exPlanations) for interpretability but suffers from three critical limitations:

Lack of Robustness: Explanations are often generated for isolated models without assessing stability across different disease stages (e.g., Normal Control vs. Mild Cognitive Impairment vs. AD) or model architectures.
Qualitative Focus: Most studies rely on visual inspection of SHAP plots rather than quantitative metrics to validate explanation reliability.
Task Disconnection: There is insufficient evidence regarding whether the explanatory markers used for diagnosis (current state) remain consistent and stable when applied to prognosis (future progression).

2. Methodology

The authors propose a multi-level explainability framework designed to quantify the coherence, stability, and transferability of SHAP explanations. The study utilizes the National Alzheimer's Coordinating Center Uniform Data Set (NACC-UDS), containing longitudinal data from over 53,000 participants.

A. Data Pipeline and Preprocessing

Dataset: 1,024 features (demographics, clinical history, neuropsychological batteries) from 195,196 instances.
Preprocessing:
- Imputation: Median for continuous variables, mode for categorical variables; features with >50% missingness were removed.
- Scaling: Standardization (mean=0, std=1) for cognitive/functional markers.
- Encoding: One-hot encoding for low-cardinality categories; frequency encoding for high-cardinality features.
Splitting: Subject-level splitting (80/20) to prevent data leakage from longitudinal visits. SMOTE was applied within training folds to address class imbalance.

B. AutoML Implementation

The study employed PyCaret, an open-source AutoML library, to optimize model selection and hyperparameters without manual intervention.

Tasks: Two primary tasks were defined:
1. Diagnosis: Classifying current cognitive state (NC, MCI, AD).
2. Prognosis: Predicting cognitive state 4 years post-baseline.
Scenarios: Four classification scenarios per task (Binary: NC vs. AD, NC vs. MCI, MCI vs. AD; and Multiclass: NC vs. MCI vs. AD).
Models: Eight models total (4 diagnostic, 4 prognostic) using algorithms like XGBoost, LightGBM, Random Forest, and Extra Trees.

C. The Multi-Level Explainability Framework

Instead of generating SHAP plots in isolation, the authors introduced a quantitative evaluation framework with three levels:

Within-Model Coherence: Evaluates the alignment between Feature Importance (FI) (embedded model logic) and SHAP (post-hoc explanation).
Inter-Model Stability: Assesses consistency of explanations across different disease stages (e.g., NC vs. AD vs. MCI vs. AD) within the same task.
Cross-Task Transferability: Tests whether explanatory structures persist when moving from diagnosis to prognosis.

Metrics Used:

Rank Correlation: Spearman's $\rho$ and Kendall's $\tau$ (including robust variants for top-k features).
Overlap: Jaccard Index at top-10 and top-20 features (J@10, J@20).
Directional Stability: SHAP sign consistency (percentage of features maintaining the same positive/negative influence).
Magnitude Shift: Mean $\Delta|SHAP|$ (change in absolute SHAP values between tasks).
Domain Contribution: Ratios of SHAP magnitude by feature groups (Cognitive/Functional, Genetic, Administrative).

3. Key Results

A. Model Performance

Diagnosis: XGBoost achieved the highest performance, particularly in the NC vs. AD scenario (Accuracy: 0.986, AUC: 0.998).
Prognosis: LightGBM performed best for NC vs. AD (Accuracy: 0.926, AUC: 0.976).
Performance was generally lower for MCI-related scenarios due to class imbalance and symptom overlap, though SMOTE mitigated this.

B. Feature Importance Findings

Dominant Markers: Cognitive and functional markers consistently dominated explanations across all models. Key features included MEMORY, JUDGMENT, ORIENT, COMMUN, PAYATTN, BILLS, and TAXES.
Genetic/Administrative: Genetic features (e.g., APOE variants) and administrative flags (PACKET, Language) showed lower but non-negligible importance, with genetic features gaining slightly more weight in prognostic models.

C. Stability and Coherence Metrics

Within-Model Coherence: High alignment between FI and SHAP rankings. Spearman's $\rho$ ranged from 0.50 to 0.95, with prognostic models showing stronger alignment than diagnostic ones.
Inter-Model Stability (Within Task):
- High stability was observed in comparisons involving established disease stages (e.g., MCI vs. AD, $\rho = 0.92$ ).
- Lower stability was found in early-stage transitions (e.g., NC vs. AD $\leftrightarrow$ NC vs. MCI in prognosis), suggesting that early progression relies on distinct or noisier markers.
Cross-Task Transferability (Diagnosis vs. Prognosis):
- High Consistency: SHAP-SHAP correlation was strong ( $\rho = 0.61–0.94$ ).
- Sign Stability: 100% of shared features maintained the same directional influence (sign consistency).
- Magnitude Stability: The mean change in SHAP magnitude ( $\Delta|SHAP|$ ) was minimal (< 0.03), indicating that the strength of the explanation does not shift significantly between diagnosing current state and predicting future state.
- Domain Shift: While core cognitive markers remained dominant, the contribution of genetic and administrative features slightly increased in prognostic models.

4. Key Contributions

Quantitative Framework for XAI: Moves beyond qualitative visual inspection by introducing a rigorous set of metrics (correlation, overlap, sign consistency, magnitude shift) to treat explainability as a measurable, robust property.
Validation of SHAP Robustness: Provides empirical evidence that SHAP explanations for AD are not just model-specific artifacts but are stable and transferable across different disease stages and predictive objectives (diagnosis vs. prognosis).
AutoML Integration: Demonstrates the viability of using AutoML (PyCaret) to generate high-performance models with consistent explainability, reducing the barrier to entry for non-technical clinicians.
Clinical Insight: Confirms that the same core cognitive markers used to diagnose AD are the primary drivers for predicting its progression, reinforcing the autoregressive nature of the disease in data-driven models.

5. Significance and Implications

Clinical Trust: By proving that explanations are stable across tasks and models, the study addresses a major barrier to clinical adoption: the fear that ML models rely on spurious correlations that change with context.
Methodological Shift: The paper advocates for evaluating XAI not just on a single model's output, but on the robustness of the explanation itself across the entire clinical workflow.
Future Directions: The framework is designed to be extensible. The authors suggest future work should incorporate multimodal data (MRI/PET) and longitudinal SHAP analysis to track how feature importance evolves dynamically over time.

In conclusion, this paper establishes a reproducible methodology for validating ML explainability in healthcare, demonstrating that SHAP-based interpretations for Alzheimer's Disease are robust, coherent, and clinically reliable across diagnostic and prognostic scenarios.