ECG Classification on PTB-XL: A Data-Centric Approach with Simplified CNN-VAE

Imagine you are a doctor trying to listen to a patient's heart. The heart speaks a secret language called an ECG (Electrocardiogram), which looks like a squiggly line on a graph. For years, doctors have had to stare at these lines for hours to spot problems like heart attacks or weak heart muscles. It's tiring, and sometimes two doctors might disagree on what they see.

This paper is about building a smart robot assistant that can read these heart lines for us. But instead of making the robot a giant, super-complex supercomputer (which is expensive and hard to fit in a hospital), the authors decided to make a small, efficient, and very well-trained robot.

Here is the story of how they did it, broken down into simple parts:

1. The Problem: A Noisy Classroom

The researchers used a massive library of heart recordings called PTB-XL. Think of this library as a classroom with 21,000 students (heart recordings).

The Imbalance: The problem was that the classroom was very unbalanced. There were tons of "Normal" students (healthy hearts) and very few students with specific conditions like "Hypertrophy" (a thickened heart muscle).
The Result: If you just let the robot learn naturally, it would become an expert at spotting "Normal" hearts but would be terrible at spotting the rare, sick ones because it barely saw them. It's like a teacher who only ever sees students with blue shirts; they won't know what to do when a student in a red shirt walks in.

2. The Solution: The "Data-First" Approach

Most scientists try to fix this by building a bigger, more complex robot brain (a fancy AI model). These authors said, "Wait a minute. Let's fix the classroom first."

They adopted a Data-Centric Approach. Instead of making the robot smarter, they made the data better:

Cleaning the Data: They cleaned up the heart signals, making sure every "lead" (the different wires on the heart) was on the same scale, like tuning all the instruments in an orchestra before a concert.
Balancing the Class: They used a clever trick called Oversampling and Downsampling.
- They took the rare "Hypertrophy" students and made photocopies of them (oversampling) so the robot saw them more often.
- They took some of the "Normal" students and asked them to sit out for a bit (downsampling) so the robot didn't get bored with them.
- Analogy: Imagine a teacher who wants to teach a student about rare animals. Instead of just showing one picture of a tiger among 100 pictures of cats, the teacher creates a special book with 50 pictures of tigers and 50 of cats. The student learns much faster.

3. The Robot: A Simple but Smart Detective

They built a model called a CNN-VAE.

CNN (Convolutional Neural Network): Think of this as a magnifying glass that scans the heart line, looking for specific shapes (like the "P-wave" or "QRS complex") just like a detective looking for footprints.
VAE (Variational Autoencoder): This is like a summarizer. It takes the long, complicated heart line and compresses it into a tiny, essential "summary" of what's important. This helps the robot ignore the noise and focus on the signal.
The Size: The best part? This robot is tiny. It has only 197,000 parameters (think of these as the robot's "brain cells"). Compare that to other models that have millions of brain cells. This one is small enough to fit on a smartphone or a portable medical device!

4. The Results: How Did It Do?

The robot was tested on new heart recordings it had never seen before.

Overall Score: It got 87% accuracy. That means it was right almost 9 times out of 10.
The Good News: It became an expert at spotting Normal hearts (91% accuracy). This is huge because it can quickly rule out healthy patients, saving doctors time.
The Challenge: It still struggled a bit with Hypertrophy (the thickened heart). It only caught about half of those cases.
- Why? Hypertrophy is like a whisper in a noisy room. The changes in the heart line are very subtle, and even with all the photocopying tricks, the robot sometimes missed them.

5. Why This Matters

The main lesson of this paper is: You don't always need a bigger, more complex engine to win the race; sometimes you just need better fuel.

Efficiency: Because the model is small, it can run on cheap devices, making heart screening possible in remote villages or small clinics that don't have supercomputers.
Reliability: By focusing on cleaning the data and balancing the classes, they got results that compete with much larger, more expensive models.
Future: The authors admit they need to get better at spotting the "whispers" (Hypertrophy), but they've proven that a simple, well-prepared approach is a powerful tool for saving lives.

In a nutshell: They took a messy pile of heart data, organized it perfectly, and taught a small, efficient robot to read it. The result is a fast, affordable tool that can help doctors spot heart problems earlier, proving that sometimes the simplest solution is the best one.

1. Problem Statement

Cardiovascular diseases are the leading cause of global mortality, making early diagnosis via Electrocardiogram (ECG) critical. While automated ECG classification using deep learning has advanced, current state-of-the-art approaches face three main limitations:

Architectural Complexity: Many models rely on massive, complex architectures (e.g., Transformers, deep ResNets) with millions of parameters, making them difficult to deploy in resource-constrained clinical settings.
Data Neglect: These models often prioritize architectural novelty over data quality, neglecting essential preprocessing and class balancing.
Class Imbalance: Medical datasets like PTB-XL suffer from severe class imbalance (e.g., Normal cases vastly outnumber Hypertrophy cases), leading to poor detection of minority classes.

The authors argue that a data-centric approach—focusing on data quality, preprocessing, and balancing—can achieve competitive results with significantly reduced model complexity.

2. Methodology

Dataset

The study utilizes the PTB-XL dataset, containing 21,837 12-lead ECG recordings from 18,885 patients.

Task: Multi-label classification into 5 diagnostic superclasses:
1. CD: Conduction disturbances
2. HYP: Left ventricular hypertrophy
3. MI: Myocardial infarction
4. NORM: Normal ECG
5. STTC: ST/T-wave changes (ischemia)
Imbalance: The dataset is highly imbalanced (NORM: ~43.7%, HYP: ~12.2%).

Data Preprocessing Pipeline

The authors implemented a rigorous three-stage preprocessing strategy:

Stratified Splitting: The data was split into training (Folds 1–9) and testing (Fold 10) sets while maintaining similar class distributions.
Targeted Sampling (Balancing): To address imbalance, the authors applied:
- Oversampling: Minority class HYP was increased from 2,392 to 4,000 samples (+67.2%).
- Undersampling: Majority class NORM was reduced from 8,564 to 4,000 samples (-53.3%).
- Other classes (CD, MI, STTC) were retained within reasonable ranges.
- Result: A balanced training set of 22,069 samples.
Lead-wise Normalization: Each of the 12 ECG leads was independently normalized using z-score normalization ( $x' = \frac{x - \mu}{\sigma + \epsilon}$ ) based on training statistics only. This accounts for varying amplitude ranges across different leads.

Model Architecture: Simplified CNN-VAE

The proposed model combines Convolutional Neural Networks (CNN) for feature extraction with a Variational Autoencoder (VAE) structure for regularization, but simplified for production deployment.

Encoder (Feature Extraction):
- Three Conv1D layers with progressive channel expansion (64 $\to$ 128 $\to$ 256 filters).
- Kernel sizes (5, 5, 3) empirically chosen to capture P-waves, QRS complexes, and T-waves.
- Includes BatchNormalization, MaxPooling1D, and Dropout for regularization.
- GlobalAveragePooling1D reduces temporal dimension to a 256-dim vector.
Latent Space (VAE Component):
- Instead of complex stochastic sampling layers (which complicate serialization), the model uses two Dense layers to generate $z_{mean}$ and $z_{log\_var}$ .
- $z_{mean}$ is used directly as the latent representation, bypassing custom Lambda layers while maintaining VAE benefits (regularization via KL divergence).
Classifier Head:
- Two Fully Connected layers (256 $\to$ 128 units) with ReLU, BatchNorm, and Dropout.
- Output layer: Dense(5, Sigmoid) for multi-label classification.
Training Configuration:
- Loss: Binary Crossentropy with class weights. Weights were inversely proportional to class frequency, with an additional 1.5x multiplier applied to HYP to boost recall.
- Optimizer: Adam (lr=0.001).
- Callbacks: EarlyStopping, ReduceLROnPlateau, ModelCheckpoint.

3. Key Contributions

Data-Centric Validation: Demonstrated that systematic preprocessing and intelligent class balancing can yield competitive performance (87% accuracy) using a simple architecture, challenging the trend of increasing model complexity.
Simplified CNN-VAE: Proposed a production-ready architecture with only 197,093 parameters (769.89 KB) that avoids custom serialization layers, making it suitable for mobile and edge devices.
Empirical Analysis of Imbalance: Provided a detailed breakdown of class-specific performance, highlighting that despite balancing efforts, Hypertrophy (HYP) remains the most difficult class to detect.
Reproducible Pipeline: Delivered a complete, interpretable pipeline suitable for clinical deployment, emphasizing data quality over architectural novelty.

4. Results

Overall Performance

Binary Accuracy: 87.01%
Weighted F1-Score: 0.7454
AUC-ROC: 0.8958
Hamming Loss: 0.1299 (13% label prediction error).
Subset Accuracy: 58.74% (reflecting the difficulty of predicting all labels correctly simultaneously).

Per-Class Performance

NORM (Normal): Excellent performance with 91% Recall and 0.849 F1-score. The model is highly effective at ruling out healthy patients.
STTC & MI: Good performance with F1-scores of 0.735 and 0.703, respectively.
HYP (Hypertrophy): The weakest link. Despite oversampling and weight adjustments, it achieved only 50.2% Recall and an F1-score of 0.537. This indicates that hypertrophy produces subtle ECG changes that are easily confused with other conditions or noise.

Comparison with State-of-the-Art

The proposed model achieved 87.0% accuracy, outperforming the baseline ResNet-50 (82.3%) while using 60% fewer parameters. It falls within the competitive range of modern complex applications (82–88% accuracy) but with significantly lower computational cost.

5. Significance and Future Directions

Significance:

Clinical Deployment: The small model size (~~770 KB) and fast inference time (~~10 ms/sample) make it ideal for mobile ECG devices and resource-limited settings.
Paradigm Shift: The work supports the "Data-Centric AI" philosophy, proving that investing in data preparation often yields better returns than designing deeper networks.
Regulatory Readiness: The model's simplicity and serialization compatibility facilitate the path toward FDA/CE approval, provided prospective validation is conducted.

Limitations & Future Work:

Hypertrophy Detection: The low recall for HYP suggests a need for advanced techniques like SMOTE, Focal Loss, or domain-specific feature engineering (e.g., QRS voltage analysis).
Generalization: The model was tested only on PTB-XL; future work must validate performance on other datasets (e.g., CPSC2018, Georgia) to assess domain shift.
Interpretability: While the model is simple, future iterations should incorporate attention mechanisms, saliency maps, or SHAP/LIME to explain why a specific diagnosis was made, which is crucial for clinician trust.
Temporal Modeling: Replacing or augmenting 1D CNNs with Recurrent (LSTM/GRU) or Transformer layers could better capture long-range temporal dependencies in ECG signals.