MedAdhereAI: An Interpretable Machine Learning Pipeline… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Managing chronic diseases like diabetes and hypertension requires patients to take medication consistently. However, many people struggle to follow their prescribed treatments. This non-adherence leads to more hospitalizations and contributes to hundreds of billions of dollars in annual healthcare costs. In many parts of the world, healthcare systems lack the resources to monitor every patient closely, making it difficult to identify who might stop taking their medicine before a health crisis occurs.

The researchers in this paper developed a system called MedAdhereAI to address this problem. Instead of relying on complex medical data like blood tests or expensive imaging, the system uses information that is already routinely collected: pharmacy refill records and insurance claims. By looking at how often a patient visits a doctor and the gaps between their medication refills, the system attempts to predict whether a patient is at risk of not taking their medicine.

To build this system, the researchers used a dataset of anonymized records for patients with diabetes and hypertension. They focused on specific patterns, such as the number of days between refills and the total number of healthcare visits. They tested two different types of mathematical models to see which could best identify at-risk patients. One model, called logistic regression, focused on straightforward relationships between data points, while the other, a random forest, looked for more complex, overlapping patterns.

The results showed that the logistic regression model was more effective at this specific task. It achieved a score of 0.82 on a standard measurement used to evaluate how well a model distinguishes between two groups (the ROC AUC), and it also showed high reliability in its probability estimates. The researchers found that the most important factors in predicting whether someone would stop taking their medication were the total number of doctor visits, the patient's age, and the length of the gaps between medication refills.

A central goal of the research was to ensure the system was not a "black box"—a system that provides an answer without explaining how it reached that conclusion. Clinicians are often hesitant to trust automated tools if they cannot see the reasoning behind a prediction. To solve this, the researchers integrated a method that provides explanations for both the entire group and for individual patients. For a single patient, the system can show exactly which factors, such as a long gap since their last refill, pushed the prediction toward a high risk of non-adherence.

The authors suggest that MedAdhereAI could serve as a decision-support tool. By identifying high-risk patients using only basic, widely available data, healthcare providers in resource-limited settings might be able to direct their limited time and resources toward the people who need the most help staying on their treatment plans. The researchers note that while the results are promising, future work is needed to test the system on different populations and to see if adding more detailed information, such as social or clinical data, improves its accuracy.

Technical Summary: MedAdhereAI

Title: MedAdhereAI: An Interpretable Machine Learning Pipeline for Predicting Medication Non-Adherence in Chronic Disease Patients Using Real-World Refill Data
Authors: Subash Yadav and Saijal Rajbhandari

1. Problem Statement

Medication non-adherence in chronic disease management (specifically diabetes and hypertension) is a global health crisis. It contributes to increased morbidity, higher hospitalization rates, and an estimated $300 billion in annual preventable healthcare costs in the U.S. alone.

The authors identify two critical gaps in existing predictive modeling solutions:

Data Accessibility: Many current models require complex, high-fidelity data (e.g., detailed clinical records, lab results, or imaging) that are often unavailable in resource-limited or fragmented healthcare settings.
The "Black-Box" Problem: Many high-performing machine learning models lack interpretability, making clinicians hesitant to trust or integrate their predictions into actual care workflows.

2. Methodology

The study proposes MedAdhereAI, a modular machine learning pipeline designed to work with minimal, routinely collected real-world data (refill and claims data).

Data Source & Preprocessing: The researchers used a publicly available dataset from Mendeley Data containing anonymized refill records for patients with diabetes and hypertension. They defined a binary target variable (ADHERENT_BINARY), where patients with $\ge 8$ refills were classified as adherent.
Feature Engineering: The pipeline focuses on temporal and demographic features, including:
- Temporal/Behavioral: Average and maximum refill gaps (avg_refill_gap, max_refill_gap), and total healthcare visits (total_visits).
- Demographic: Age and gender.
Model Selection: To balance predictive power with clinical transparency, two models were implemented using scikit-learn:
1. Logistic Regression: Chosen for its coefficient-based interpretability and familiarity to clinicians.
2. Random Forest: Chosen to capture non-linear interactions between features.
Explainability Framework: The study integrated SHAP (Shapley Additive Explanations) to provide both global interpretability (identifying which features drive the model overall) and local interpretability (explaining why a specific individual patient was flagged as high-risk).

3. Key Contributions

Minimalist Data Approach: Proves that high predictive accuracy can be achieved using only widely available refill and claims data, rather than expensive clinical datasets.
Interpretability-First Design: By combining traditional models (Logistic Regression) with SHAP, the pipeline provides actionable "why" explanations for clinicians.
Reproducible Pipeline: The authors provided a modular, scalable, and open-source pipeline (available on GitHub) designed for low computational overhead, making it suitable for deployment in resource-constrained environments.

4. Results

The Logistic Regression model outperformed the Random Forest model across nearly all key metrics:

Metric	Logistic Regression	Random Forest
ROC AUC	0.82	0.77
Brier Score	0.1749 (Better calibration)	N/A
Accuracy	0.70	0.65
Precision	0.76	0.70
Recall	0.74	0.66
F1-Score	0.75	0.68

Feature Importance: SHAP analysis identified total_visits, AGE, and avg_refill_gap as the most significant predictors of non-adherence.

5. Significance and Clinical Impact

MedAdhereAI serves as a clinical decision-support tool. Its significance lies in its ability to:

Enable Early Intervention: By identifying high-risk patients before they miss critical doses, healthcare providers can deploy targeted interventions (e.g., patient counseling or reminders).
Optimize Resource Allocation: In settings with limited staff and funding, the tool helps prioritize patients who need the most attention.
Foster Clinician Trust: Through SHAP force plots, the model moves from a "black box" to a transparent assistant, showing clinicians exactly which patient behaviors (like increasing refill gaps) are driving the risk score.

MedAdhereAI: An Interpretable Machine Learning Pipeline for Predicting Medication Non-Adherence in Chronic Disease Patients Using Real-World Refill Data