An Explainable Ensemble Framework for Alzheimer's Disease Prediction Using Structured Clinical and Cognitive Data

Imagine your brain is a bustling city. For a long time, it runs smoothly, with traffic flowing and lights turning green. But Alzheimer's disease is like a slow-acting fog that creeps in, first making it hard to find your way around (memory loss), then causing traffic jams (confusion), and eventually shutting down the whole city (total dependency).

The problem is that this fog is sneaky. By the time you can clearly see it, it's often too late to do much about it. Doctors usually need expensive, invasive tests (like taking a sample of the brain's "water" or doing heavy MRI scans) to spot it, which isn't practical for everyone.

This paper is about building a smart, transparent digital detective that can spot this fog early, using only a simple checklist of questions and basic health stats.

Here is how the researchers built this detective, explained in everyday terms:

1. The Ingredients: The "Health Report Card"

Instead of using complex brain scans, the team used a "report card" filled with 33 everyday facts about a patient. Think of this as a detailed resume for your health. It includes:

Demographics: How old are you? Are you male or female?
Lifestyle: Do you sleep well? Do you exercise? What do you eat?
Brain Tests: How well did you do on a memory test (MMSE)? Can you still dress yourself or manage your money (Functional Assessment)?
Body Stats: Blood pressure, cholesterol, and BMI.

2. The Team of Detectives: The "Ensemble"

The researchers didn't just hire one detective; they hired a team of five different experts (called an "Ensemble").

The Experts: They used five powerful computer algorithms (Random Forest, XGBoost, LightGBM, CatBoost, and Extra Trees). Imagine these as five different doctors who all look at the same patient but use slightly different ways of thinking.
The Strategy: Instead of letting just one doctor make the final call, they let the whole team vote. If four out of five say, "This looks like Alzheimer's," the system flags it. This is like asking a panel of judges instead of just one to ensure the decision is fair and accurate.
The Deep Learning "Super-Computer": They also tried a very complex neural network (a type of AI that mimics the human brain), but surprisingly, the team of five "tree-based" experts performed better. It turns out, for this specific job, a well-coordinated team of specialists is better than one super-complex machine.

3. Cleaning the Data: "The Kitchen Prep"

Before the detectives could work, the data had to be prepped.

Fixing the Imbalance: In their dataset, there were far more healthy people than sick people (like having 100 healthy apples and only 50 rotten ones). If the computer just guessed "Healthy" every time, it would be right 70% of the time but useless for finding the sick ones. The researchers used a technique called SMOTE-Tomek to artificially create more examples of the "sick" cases so the detectives could learn what to look for properly.
Creating New Clues: They didn't just use the raw numbers; they combined them to create new clues. For example, they multiplied "Age" by "BMI" to see if being older and heavier created a specific risk pattern. It's like realizing that "rain" + "wind" is a bigger problem than just "rain" alone.

4. The "Glass Box": Explainable AI (XAI)

This is the most important part. Usually, AI is a "Black Box"—you put data in, and it spits out an answer, but you have no idea why.

The Problem: If a doctor says, "The computer says you have Alzheimer's," but can't explain why, they won't trust it.
The Solution: The researchers used a tool called SHAP (which stands for SHapley Additive exPlanations). Think of SHAP as a magnifying glass that shows exactly which clues tipped the scales.
The Result: The AI didn't just say "Yes" or "No." It said, "We think this person has Alzheimer's because their memory test score dropped significantly, combined with their age and difficulty in daily tasks." This transparency makes doctors trust the system.

5. The Results: Who Won?

When they tested this system on people it had never seen before:

Accuracy: The team of experts (especially Random Forest and Gradient Boosting) got it right about 86% of the time.
Reliability: They were very good at not crying wolf. If the system said "Alzheimer's," it was almost certainly correct (high precision).
The Winners: The "Team Vote" (Ensemble) beat the "Super-Computer" (Deep Learning). The best single detective was Random Forest.

The Big Takeaway

This paper proves that you don't need a million-dollar MRI machine to get a good early warning for Alzheimer's. By combining simple, everyday health data with a smart team of AI algorithms that can explain their reasoning, we can build a tool that is:

Cheaper: Uses data doctors already have.
Faster: Can screen more people.
Trustworthy: Tells the doctor why it made the diagnosis.

It's like having a wise, transparent assistant who helps doctors catch the fog before it turns into a storm, giving patients a better chance to manage their lives.

1. Problem Statement

Alzheimer's Disease (AD) is a progressive neurodegenerative disorder affecting millions globally, with early detection being critical for intervention and quality of life management. However, traditional diagnostic methods (neuroimaging, CSF analysis, and cognitive tests) are often expensive, invasive, and impractical for widespread screening, particularly in resource-limited settings. Furthermore, existing Machine Learning (ML) and Deep Learning (DL) models often suffer from:

Lack of Interpretability: "Black box" models hinder clinical trust.
Data Imbalance: Medical datasets often have skewed class distributions.
Over-reliance on Imaging: Many high-performing models depend solely on MRI/CT data rather than accessible structured clinical data.
Data Leakage: Improper splitting strategies in previous studies lead to inflated performance metrics.

The paper addresses the need for a cost-effective, non-invasive, and interpretable AI framework that utilizes structured clinical, lifestyle, and cognitive data to predict AD with high reliability.

2. Methodology

The study proposes a comprehensive end-to-end machine learning pipeline designed to ensure data integrity, robustness, and explainability.

A. Data Acquisition and Preprocessing

Dataset: Utilized an open-source clinical dataset (El Kharoua, Kaggle) containing 2,149 samples and 33 attributes.
Features: Includes demographics (Age, Gender, BMI), cognitive metrics (MMSE, Functional Assessment), lifestyle factors (Sleep, Diet, Activity), and clinical markers (Cholesterol, Blood Pressure).
Class Distribution: The dataset exhibits moderate imbalance (1,389 Non-AD vs. 760 AD).
Leakage Prevention: Implemented a strict two-stage stratified splitting strategy:
1. Split data into 85% (Temporary) and 15% (Independent Test).
2. Split the Temporary set into 70% (Training) and 15% (Validation).
- Crucial: All preprocessing (scaling, engineering, balancing) was fitted only on the training set to prevent data leakage.

B. Feature Engineering

A structured pipeline was applied exclusively to the training set:

Interaction Features: Created 6 non-linear terms (e.g., $BMI \times Age$ , $MMSE \times FunctionalAssessment$ ).
Polynomial & Ratio Features: Added higher-order terms ( $Age^2$ , $BMI^2$ ) and ratios ($MMSE/PhysicalActivity$).
Correlation Reduction: Removed features with pairwise correlation $|r| > 0.95$ to reduce multicollinearity.
Scaling: Applied StandardScaler.

C. Class Imbalance Handling

To address the skewed target variable, the framework employed a SMOTE–Tomek hybrid resampling technique on the training data to balance the classes before model training.

D. Model Development

The study evaluated two categories of models:

Deep Learning: A feed-forward Artificial Neural Network (ANN) with architecture $512 \to 256 \to 128 \to 64$ , using ReLU, Batch Normalization, Dropout, and L2 regularization.
Ensemble Learning: Five optimized tree-based algorithms:
- Random Forest (RF)
- XGBoost
- LightGBM
- CatBoost
- Extra Trees (ET)

E. Ensemble Strategies

Three meta-strategies were tested on the validation set to improve generalization:

Hard Voting
Soft Voting
Stacking (using XGBoost as the meta-learner)

F. Explainability (XAI)

To ensure clinical transparency, the study integrated:

Tree-based Feature Importance: Gini impurity.
Permutation Importance: To measure sensitivity to feature perturbation.
SHAP (SHapley Additive exPlanations): For both global and local (instance-level) interpretability.

3. Key Contributions

Robust Pipeline: Introduction of a rigorous workflow that strictly prevents data leakage through multi-stage stratified splitting and isolated preprocessing.
Hybrid Resampling: Effective application of SMOTE–Tomek to handle class imbalance in structured clinical data.
Comparative Analysis: A systematic comparison of five state-of-the-art ensemble algorithms against a Deep Neural Network, demonstrating that tree-based ensembles outperform DL in this specific context.
Clinical Interpretability: Moving beyond accuracy metrics to provide actionable insights via SHAP, identifying specific clinical drivers (e.g., MMSE, Functional Assessment) that align with medical consensus.

4. Results

The models were evaluated on the unseen independent test set using Accuracy, Precision, Recall, F1-Score, and AUC-ROC.

Performance Comparison:
- Tree-based Ensembles significantly outperformed the Deep Neural Network (ANN).
- Gradient Boosting achieved the highest F1-Score (76.19%) and Precision (96.00%), indicating high reliability in positive predictions with minimal false alarms.
- Random Forest achieved the highest AUC (0.9059) and Accuracy (85.76%).
- Deep Neural Network lagged behind with an Accuracy of 80.19% and AUC of 0.8488.
Ensemble vs. Single Models:
- Counter-intuitively, optimizing a single strong model (Random Forest with Best Seed) yielded the highest accuracy (86.38%), slightly outperforming complex voting and stacking ensembles. This suggests that for this specific dataset, hyperparameter tuning of a single robust model is more effective than aggregation.
Error Analysis:
- Tree-based models minimized False Positives significantly (only 3 misclassifications for RF and Gradient Boosting) compared to the ANN (20 false alarms), making them safer for clinical decision support.

Key Predictive Features (via SHAP & Importance Analysis)

The XAI analysis identified the most influential determinants for AD prediction:

Cognitive Assessments: MMSE (Mini-Mental State Examination) and Functional Assessment.
Demographics: Age and Gender (specifically their interaction).
Lifestyle/Metabolic: Physical Activity, ADL (Activities of Daily Living), and engineered interaction features.

5. Significance and Conclusion

This research demonstrates that structured clinical and cognitive data, when processed through an explainable ensemble framework, can achieve high diagnostic accuracy for Alzheimer's Disease without relying on expensive neuroimaging.

Clinical Utility: The high precision and low false-positive rate of the Random Forest and Gradient Boosting models make them suitable for clinical decision support systems, helping clinicians prioritize patients for further invasive testing.
Trustworthiness: By integrating SHAP and feature importance, the model provides transparent reasoning for its predictions, fostering trust among medical professionals.
Future Directions: The authors suggest extending the framework to multi-stage disease classification, incorporating longitudinal data to track progression, and integrating multimodal data (MRI, EEG) to further enhance early detection capabilities.

In summary, the paper validates that Explainable AI (XAI) combined with optimized tree-based ensembles offers a superior, transparent, and reliable alternative to deep learning for AD prediction using routine clinical data.