Tipping the Balance: Impact of Class Imbalance Correction on the Performance of Clinical Risk Prediction Models

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a doctor trying to build a crystal ball that predicts who will get sick. You have a huge pile of patient records: 95 people are healthy, and only 5 people got sick. This is a classic "imbalanced" problem. The 5 sick people are the "minority," and the 95 healthy people are the "majority."

In the world of machine learning, when you have so few "sick" examples, the computer often gets lazy. It learns that the easiest way to be "right" is to just guess "healthy" for everyone. To fix this, data scientists often use a technique called Resampling. They try to force the computer to pay more attention to the sick people by either:

Copying the sick records (Oversampling).
Throwing away some healthy records (Undersampling).
Inventing fake sick records that look like real ones (SMOTE).

The big question this paper asks is: Does this "fix" actually make the crystal ball work better, or does it just break it?

The Big Discovery: The "Fake Balance" Trap

The authors of this paper took 10 different real-world medical datasets (involving over 600,000 patients) and tested these resampling tricks. They found a surprising and important truth:

Resampling didn't make the models smarter at spotting the sick people, but it did make them terrible at telling you how likely it is that someone is sick.

Here is the breakdown using simple analogies:

1. The "Ranking" vs. The "Probability" (Discrimination vs. Calibration)

Think of a medical test like a weather forecast.

Discrimination (Ranking): This is like saying, "It is more likely to rain today than yesterday." The model is good at ranking: "Patient A is sicker than Patient B."
Calibration (Probability): This is like saying, "There is a 70% chance of rain." This is the actual number you need to decide whether to bring an umbrella or not.

The Study's Finding:
The resampling tricks were like putting on a pair of glasses that made the sick people stand out a little more, but they also distorted the colors of the sky.

The Good News: The models could still tell who was sicker than whom (the ranking stayed the same).
The Bad News: The actual numbers (the probabilities) became completely wrong. If the model said "50% chance of heart attack," it might actually be a 5% chance. Or if it said "10%," it might be 50%.

2. The "Cooking" Analogy

Imagine you are a chef trying to learn how to make a perfect soup.

The Real Data: You have 100 bowls of clear broth and only 5 bowls of spicy soup. You taste them all. You learn that "spicy" is rare.
The Resampling Trick: To help you learn, someone gives you 95 extra bowls of spicy soup (by copying the 5 real ones). Now you have 100 spicy and 100 clear.
The Result: You become an expert at tasting "spicy" vs. "clear." You can rank them perfectly. BUT, when you go back to the real kitchen where spicy soup is actually rare, you are now convinced that every bowl is spicy. You overestimate the risk. You tell the customer, "This soup is 90% spicy!" when it's actually just 5%.

The study found that by artificially balancing the data, the models learned the wrong "base rate" of the disease. They forgot how rare the disease actually is in the real world.

3. The "Fake Fruit" Problem (SMOTE)

One of the methods, SMOTE, is like a chef trying to invent a new fruit by blending a strawberry and a banana.

In a computer, this creates "synthetic" patients that are mathematical averages of real patients.
The study found that these fake patients often look weird to the model. They are like a "fruit smoothie" that doesn't exist in nature. When the model tries to learn from these fake examples, it gets confused about what a real sick person looks like, leading to even worse predictions.

So, What Should Doctors Do?

The paper concludes with a very clear message for anyone building medical AI:

Stop trying to "fix" the data imbalance by copying or deleting data.

Instead, do this:

Train on the real, messy data. Let the model see the true rarity of the disease.
Trust the ranking. If the model says Patient A is higher risk than Patient B, that's usually reliable.
Adjust the "Trigger," not the "Training." If you need to catch more sick people (increase sensitivity), don't retrain the model. Just lower the alarm threshold. Think of it like turning up the volume on a smoke detector. You don't need to rebuild the detector; you just need to make it scream at a lower level of smoke.

The Bottom Line

In the world of medical prediction, accuracy in numbers matters more than just spotting the right person.

By trying to force the data to look balanced, researchers were accidentally breaking the model's ability to tell the truth about how likely an event is. The best approach is to let the model learn from the real, unbalanced world, and then simply adjust the decision rules later if needed. Don't fake the data; just tune the alarm.

1. Problem Statement

In clinical machine learning, many critical outcomes (e.g., sepsis, mortality, rare diseases) are rare events, leading to class imbalance between positive cases (minority) and negative cases (majority).

The Common Practice: To address this, practitioners frequently apply class-imbalance correction techniques (resampling) during model training to force a balanced dataset (e.g., 1:1 ratio). The goal is often to improve sensitivity or accuracy metrics.
The Unresolved Issue: While these techniques are known to alter model behavior, their specific impact on probabilistic calibration (the agreement between predicted probabilities and actual event rates) in real-world clinical settings is poorly understood. Poor calibration can lead to dangerous clinical decisions (e.g., overestimating risk causing unnecessary treatment, or underestimating risk causing missed interventions).
The Gap: Previous evidence suggesting that resampling harms calibration comes largely from simulations or single-method studies. There is a lack of empirical evidence across diverse, large-scale, real-world clinical datasets and heterogeneous model families.

2. Methodology

The authors conducted a comprehensive cross-problem empirical evaluation involving 10 distinct clinical datasets and multiple machine learning algorithms.

Datasets:
- Scale: 10 datasets covering diverse medical domains (cardiometabolic, sepsis, ICU mortality, etc.) with a total of 605,842 patients.
- Imbalance: Event rates ranged from 1.9% to 34.9%.
- Sources: Included public datasets (NHANES, eICU, Pima Indians) and proprietary/clinical trial data (SELECT, Dan-NICAD).
Models:
- Evaluated a heterogeneous set of algorithms including Linear models (Logistic Regression), Tree-based ensembles (Random Forest, XGBoost, CatBoost, Gradient Boosting), Neural Networks (ANN), and Tabular Foundation models (TabPFN, AutoGluon).
Experimental Design:
- Baseline: Models trained on the original, imbalanced data.
- Intervention: Models retrained using three common resampling strategies to achieve a 1:1 class ratio in the training set:
  1. Random Oversampling (ROS): Duplicating minority class instances.
  2. Random Undersampling (RUS): Randomly removing majority class instances.
  3. SMOTE (Synthetic Minority Oversampling Technique): Interpolating synthetic minority samples.
- Evaluation: All models were evaluated on held-out test data (preserving the original natural class distribution) to ensure unbiased assessment.
Metrics:
- Discrimination: ROC-AUC (c-statistic) and PR-AUC.
- Calibration: Brier Score (overall accuracy), Calibration Intercept (systematic bias), Calibration Slope (over/underfitting), and Calibration Plots.
Statistical Analysis: Used non-parametric bootstrap resampling (5,000 samples) and Wilcoxon signed-rank tests to determine the significance of differences between resampled and original models.

3. Key Contributions

Large-Scale Empirical Validation: This is one of the first studies to systematically test imbalance correction across 10 diverse real-world clinical datasets and 8+ different model families, moving beyond theoretical simulations.
Focus on Calibration: The study explicitly prioritizes calibration over simple discrimination metrics, highlighting that a model can have high AUC but be clinically useless due to miscalibrated probabilities.
Reproducibility: The authors invited original model creators to retrain their models using the same pipelines but with modified class distributions, ensuring the findings reflect real-world model development practices.

4. Key Results

The study yielded two primary findings that challenge the routine use of resampling in clinical risk modeling:

A. Discrimination (ROC-AUC)

No Systematic Improvement: Resampling did not improve the discriminatory ability of the models.
Marginal or Negative Impact: Changes in ROC-AUC were small and inconsistent.
- ROS: Median change of -0.002 ( $p<0.05$ ).
- SMOTE: Median change of -0.01 ( $p<0.05$ ).
- RUS: Median change of -0.004 ( $p>0.05$ , not significant).
Exception: In one specific case (Hypoglycemia in eICU with $N>3.6$ M), RUS slightly improved ROC-AUC (0.88 to 0.91), but this came at the cost of significantly degraded calibration (Brier score increased from 0.016 to 0.118) and lower PR-AUC.

B. Calibration (Probabilistic Accuracy)

Significant Degradation: Resampling consistently degraded calibration across all datasets and model families.
Brier Score: Models trained with resampling exhibited significantly higher Brier scores (indicating worse predictive accuracy), with median increases ranging from 0.029 to 0.080 ( $p<0.05$ ).
Systematic Distortion:
- Intercept: Shifted away from 0, indicating systematic over- or underestimation of risk.
- Slope: Shifted away from 1, indicating overfitting or underfitting of the probability scale.
Dissociation: The study observed a clear dissociation where rank-based performance (AUC) was preserved, but the reliability of the predicted probabilities collapsed.

5. Significance and Recommendations

Clinical Implications: The findings suggest that applying class-imbalance correction by default is harmful for clinical risk prediction. Since clinical decisions often rely on absolute risk thresholds (e.g., "treat if risk > 10%"), miscalibrated probabilities can lead to inappropriate patient management.
Theoretical Insight: The study confirms that outcome imbalance is not inherently problematic for discrimination. The perceived need for resampling often stems from optimizing for metrics like sensitivity at a fixed threshold, which can be achieved more safely by tuning the decision threshold on a well-calibrated model trained on original data.
Actionable Guidelines: The authors recommend:
1. Train on original data whenever possible to preserve natural event rates.
2. Prioritize calibration evaluation (Brier score, calibration plots) over AUC alone.
3. Use threshold tuning or cost-sensitive decision rules to adjust sensitivity/specificity post-training rather than retraining on resampled data.
4. If resampling is absolutely necessary, recalibration on independent validation data is mandatory before clinical deployment.

Conclusion: In diverse real-world clinical settings, common imbalance correction techniques (ROS, RUS, SMOTE) fail to improve discrimination and frequently degrade the probabilistic calibration essential for safe clinical decision-making.