Detecting Structural Heart Disease from Electrocardiograms via a Generalized Additive Model of Interpretable Foundation-Model Predictors

Imagine your heart is a complex orchestra. Sometimes, the musicians (the heart valves, muscles, and chambers) get out of tune or develop structural problems, like a valve that doesn't close tight or a chamber that gets too big. This is called Structural Heart Disease (SHD).

The problem is that these "out-of-tune" sections often play so quietly that the conductor (the doctor) can't hear them just by listening to the patient. The only way to be sure is to use a high-tech camera called an Echocardiogram (ECHO), which takes a video of the heart. But ECHOs are expensive, require special experts, and aren't available everywhere.

Enter the Electrocardiogram (ECG). This is the cheap, easy, and everywhere test where you stick stickers on your chest to listen to the heart's electrical rhythm. It's like listening to the orchestra's sheet music. The problem? The "sheet music" for structural heart disease is so subtle that human eyes can't read the hidden notes.

The Old Way: The "Black Box" Wizard

Recently, scientists tried to use Artificial Intelligence (AI) to read these hidden notes. They built massive, complex AI models (like deep neural networks) that could look at the ECG and guess if a patient had heart disease.

These AI "wizards" were incredibly accurate. But they were black boxes.

The Analogy: Imagine a wizard who tells you, "This person has a heart problem," but when you ask, "Why?" the wizard just shrugs and says, "The magic numbers said so."
The Problem: Doctors are skeptical. If they don't understand why the AI made a decision, they can't trust it, and they won't use it in real hospitals.

The New Way: The "Translator" Framework

The authors of this paper came up with a clever solution. They didn't throw away the powerful AI; they just added a translator to make it speak human.

Here is how their new system works, step-by-step:

1. The "Expert Translator" (The Foundation Model)

First, they used a super-smart AI that has already learned to read standard heart conditions (like "Irregular heartbeat" or "Fast heart rate"). Think of this AI as a master translator who is fluent in the language of heart rhythms.

Instead of asking the AI to guess the final answer directly, they ask it: "What is the probability that this patient has an irregular heartbeat? What about a fast heart rate? What about a thick heart muscle?"
The AI gives 71 different "risk scores" for these standard conditions. These scores are like clues.

2. The "Detective Board" (The Generalized Additive Model)

Now, instead of letting a black box guess the final answer, the researchers put these 71 clues onto a detective board.

They use a statistical method called a Generalized Additive Model (GAM).
The Analogy: Imagine a detective looking at a board with strings connecting clues to a suspect. The detective can see exactly how much "Irregular Heartbeat" contributes to the suspicion, and how much "Fast Heart Rate" adds to it.
Crucially, this system allows for non-linear relationships. It's not just "More clues = More danger." It might be "A little bit of this clue is fine, but if it gets too high, the danger skyrockets." The system maps out these curves so doctors can see the shape of the danger.

Why This is a Game Changer

1. It's Transparent (No More Black Boxes)
Because the system uses the 71 standard clues as inputs, a doctor can look at the result and say, "Ah, the AI is worried because the patient has a high risk of Left Ventricular Hypertrophy (a thick heart muscle), and when that risk gets above 0.6, the chance of structural disease jumps."

Metaphor: Instead of a magic spell, it's like a recipe. "We added 2 cups of flour (Clue A) and 1 cup of sugar (Clue B), and that's why the cake (the diagnosis) turned out this way."

2. It's Smarter with Less Data
Usually, AI needs to eat massive amounts of data to learn. This new method is like a student who learns from a great teacher (the Foundation Model) and then only needs a little bit of extra practice to master the specific test.

The Result: The new method performed better than the best existing AI, even when trained on only 30% of the data. It's like a student getting an A+ after studying for 3 hours, while the other students needed 10 hours.

3. It Works for Everyone
The researchers tested this on different groups of people (different ages, races, and genders). The system worked just as well for everyone, proving it doesn't have hidden biases.

The Bottom Line

This paper proposes a bridge between old-school statistics (which are clear and explainable) and modern AI (which is powerful but mysterious).

By using AI to generate clear, understandable clues, and then using statistics to connect those clues to the final diagnosis, they created a tool that is:

Accurate: It finds heart disease better than current methods.
Efficient: It learns faster and needs less data.
Trustworthy: Doctors can see exactly why it made a decision.

In the future, this could mean that a simple, cheap ECG test in a rural clinic could reliably flag patients who need a more expensive heart scan, saving lives and money, all while keeping the doctor in the loop.

1. Problem Statement

Structural Heart Disease (SHD) is a prevalent global health challenge with many undiagnosed cases. Early detection is critical but currently limited by the high cost and specialized expertise required for Echocardiography (ECHO), the diagnostic gold standard.

The Opportunity: Electrocardiograms (ECGs) are low-cost and widely available. Recent AI research suggests ECGs contain subtle patterns predictive of SHD.
The Challenge: Existing AI solutions for ECG-based SHD detection rely on end-to-end deep learning models (e.g., CNNs, Transformers). These are "black-box" systems that lack interpretability. Clinicians cannot easily understand why a model flags a patient as high-risk, hindering trust and clinical adoption. Furthermore, SHD patterns are often too subtle for human visual inspection, making traditional rule-based interpretation insufficient.

2. Methodology

The authors propose a hybrid modeling framework that combines the predictive power of deep learning foundation models with the interpretability of classical statistical modeling.

A. Core Architecture

The model uses a Generalized Additive Model (GAM) formulation:
$g\{E(y | z, X)\} = \gamma^\top z + \sum_{j=1}^{J} f_j[\sigma\{h_j(X)\}]$
Where:

$y$ : Binary response (presence of SHD).
$z$ : Clinical covariates (age, sex, heart rates, intervals).
$X$ : Raw ECG waveform.
$h_j(X)$ : A latent predictor extracted from a pre-trained ECG Foundation Model.
$\sigma(\cdot)$ : Sigmoid function, converting logits to probabilities (risks) of specific traditional ECG diagnoses.
$f_j(\cdot)$ : Unknown smooth, non-parametric functions (estimated via B-splines) that map the predictor risk to the SHD outcome.

B. Key Components

Predictor Extraction (Foundation Model):
- The authors utilize ST-MEM, a Transformer-based ECG foundation model.
- Instead of training an end-to-end classifier, they use the foundation model as a feature extractor.
- They apply a post-training strategy (linear probing followed by regularization with stochastic depth and dropout) on the PTB-XL dataset (21,837 ECGs) to fine-tune the model for 71 traditional ECG diagnostic labels (e.g., Atrial Fibrillation, Left Ventricular Hypertrophy).
- The output of this stage is a vector of 71 calibrated probabilities representing the risk of specific, clinically recognized ECG abnormalities.
Additive Modeling (Interpretability):
- These 71 probabilities serve as inputs to the GAM.
- The relationship between each predictor and SHD risk is modeled using B-spline bases (order $\zeta=4$ ). This allows the model to capture non-linear associations (e.g., a risk might only increase sharply after a certain probability threshold) while remaining transparent.
- The model is trained using penalized logistic regression with $\ell_2$ regularization.

3. Key Contributions

Interpretable Framework: The paper introduces a novel paradigm that bridges deep learning and statistical modeling. It transforms "black-box" foundation model outputs into clinically meaningful, interpretable predictors (risks of known ECG diagnoses) and models their effects transparently.
Superior Performance & Data Efficiency: The method outperforms the current state-of-the-art (Columbia mini model) while requiring significantly less data.
Physiological Insights: By estimating entry-wise functions ( $f_j$ ), the model reveals non-linear relationships between traditional ECG diagnoses and SHD, offering new clinical insights into how subtle electrical patterns correlate with structural disease.

4. Experimental Results

The method was evaluated on the EchoNext benchmark, containing 82,543 paired ECG-ECHO records from 36,286 patients.

Performance Metrics:
- AUROC: 82.8% (vs. 82.0% for the Columbia mini model).
- AUPRC: 79.7% (vs. 78.9%).
- F1 Score: 71.8% (vs. 70.8%).
- Result: The proposed model achieved relative improvements of +0.98% (AUROC), +1.01% (AUPRC), and +1.41% (F1) over the best deep learning baseline.
Data Efficiency:
- When trained on only 30% of the available data, the proposed additive model matched or slightly exceeded the performance of the Columbia mini model trained on 100% of the data. This indicates superior sample efficiency.
Subgroup Analysis:
- The model demonstrated robust performance across heterogeneous subgroups (age, sex, race/ethnicity, and clinical context like emergency vs. outpatient), showing no significant bias and maintaining stability where other models might degrade.
Interpretability Findings:
- Visualizing the estimated functions (e.g., for Inferior Myocardial Infarction or Left Ventricular Hypertrophy) showed that the risk of SHD increases non-linearly with the probability of these traditional ECG diagnoses. This confirms that established ECG criteria contain latent diagnostic information for SHD that is not linearly captured by standard guidelines.

5. Significance and Impact

Clinical Adoption: By providing a transparent decision-making process where clinicians can see which ECG patterns (e.g., "risk of Atrial Fibrillation") are driving the SHD prediction, the model addresses the "black-box" barrier to AI adoption in cardiology.
Paradigm Shift: The work demonstrates that interpretability and high predictive performance are not mutually exclusive. It suggests a future where foundation models act as "feature engines" for classical statistical models, combining the best of both worlds.
Scalable Screening: The approach offers a pathway to scalable, low-cost, early screening for Structural Heart Disease using standard ECGs, potentially reducing the burden on echocardiography resources and catching undiagnosed cases earlier.

Limitations: The study relies on a single-center dataset (Columbia) and a composite SHD label. Future work requires multi-center validation and analysis of specific SHD subtypes.