Predictors of COVID-19 hospital outcomes: a machine… — Plain-Language Explanation

Original authors: Vazquez, J., Taylor, L., Chen, Y.-Y. K., Araya, K., Farnsworth, M. G., Xue, X., Hasan, M., N3C Consortium,

Published 2026-03-09

📖 6 min read🧠 Deep dive

View on medRxiv ↗PDF ↗

CC BY 4.0

Original authors: Vazquez, J., Taylor, L., Chen, Y.-Y. K., Araya, K., Farnsworth, M. G., Xue, X., Hasan, M., N3C Consortium,

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are the captain of a massive fleet of 51 different hospitals, and a storm (the COVID-19 pandemic) has just hit. You have a huge list of passengers (263,619 patients) on board, and your job is to figure out two things:

Who is likely to survive the storm? (Mortality prediction)
How long will they stay on the ship before they can go home? (Length of Stay prediction)

To do this, you hire a team of four different "crystal ball" experts (Machine Learning models) to look at the passengers' medical charts and make predictions. This paper is the report card on how well those crystal balls worked.

Here is the story of what they found, explained simply:

1. The Crystal Balls vs. The Crystal Ball

The researchers didn't just use one crystal ball; they used four different types of "AI" to see which was best:

The Old School Statistician: A simple, reliable math formula.
The Random Forest: A group of trees that vote on the answer.
The XGBoost: A super-smart, fast learner that gets better with every mistake.
The Neural Network: A digital brain that tries to mimic how human neurons think.

They fed these AI models data like age, weight, existing health problems (like diabetes or heart disease), and whether the patient had gotten vaccinated.

2. The "Who Survives?" Game (Mortality Prediction)

The Result: The AI models were okay, but not amazing.

The Score: They got a score of about 0.72 out of 1.0. Think of this like a test grade. A 72% is a "C" or a "B-". It's better than guessing, but it's not perfect.
The Catch (The "Class Imbalance" Problem): Here is the tricky part. In a hospital, most people survive, and only a few pass away. It's like trying to find a needle in a haystack.
- Scenario A (No help): The AI looked at the haystack and said, "I'll just guess everyone survives." It got a high score because it was right most of the time, but it missed every single person who was actually going to die. It was useless for saving lives.
- Scenario B (The "SMOTE" Trick): The researchers used a trick called SMOTE. Imagine the AI is a chef, and there are very few "death" ingredients in the kitchen. SMOTE is like the chef making fake copies of those rare ingredients so the chef can practice cooking with them.
- The Trade-off: When they used SMOTE, the AI got much better at spotting the people who might die (it stopped missing the needles). However, it started getting confused about who would survive, and its overall "confidence score" dropped. It was like the chef became great at making the rare dish but started messing up the regular meals.

The Lesson: You can't just look at the "overall grade" (AUROC). You have to look at whether the AI actually catches the people who need help.

3. The "How Long Will They Stay?" Game (Length of Stay)

The Result: The AI models were terrible at this.

The Score: They got a score of roughly 0.06 out of 1.0. This is like trying to predict the weather next year using only a thermometer from yesterday.
Why? The AI looked at the patient's health, but it couldn't see the hospital.
- One hospital might discharge patients quickly because they have a great team of social workers.
- Another hospital might keep patients longer because they have fewer beds or different rules.
- The AI didn't have a way to "see" these invisible hospital rules. It was like trying to guess how long a car trip will take by only looking at the driver, without knowing if the traffic lights are broken or if there's a roadblock ahead.

4. The "Remdesivir" Mystery

The study also looked at who got a specific medicine called Remdesivir.

The Observation: People who got Remdesivir were actually sicker to begin with. They were older, had more health problems, and had higher death rates.
The Analogy: Imagine you see a group of people wearing heavy raincoats and umbrellas. You might think, "Wow, those raincoats are dangerous; people wearing them get wet!" But actually, the raincoats didn't cause the wetness; the people wore them because it was already raining hard.
The Takeaway: The medicine wasn't killing people; the sickest people were just the ones getting the medicine. This is called "confounding by indication." It means you can't just compare the two groups to see if the drug works; you have to account for the fact that the sick people were chosen first.

5. The "Senior Citizen" Problem

When the researchers tested the AI only on people over 65, the models got even worse.

Why? When you look at a group of 20-year-olds, they are all very different. But when you look at a group of 80-year-olds, they often share similar health struggles (arthritis, heart issues, etc.).
The Analogy: It's like trying to sort a deck of cards where every card is a slightly different shade of red. It's very hard to tell them apart. The AI needed more clues (like how frail a person is or how their blood work changes day-to-day) to make a good guess for older adults.

The Big Picture Conclusion

This paper tells us three main things in plain English:

AI is a helpful assistant, not a fortune teller. It can give a rough idea of who is at risk, but it's not perfect yet.
Context matters. You can't just look at the patient; you have to understand the hospital they are in to predict how long they will stay.
Don't trust the "Average Score." In medicine, it's better to have a model that catches the sick people (even if it makes a few mistakes) than a model that just says "everyone is fine" because that's statistically easier.

The researchers are essentially saying: "We built some cool tools, and they work okay, but to really save lives and manage hospitals, we need to feed them more data and teach them to understand the 'human' side of the hospital, not just the numbers."

1. Problem Statement

The study addresses the critical need for risk stratification and resource planning during severe acute respiratory infections (SARIs), specifically COVID-19. While Machine Learning (ML) offers promise for predicting clinical outcomes using Electronic Health Records (EHR), several persistent challenges remain:

Data Heterogeneity: EHR data is often unstructured or inconsistent across institutions.
Class Imbalance: Mortality events are rare compared to survival, leading to models that may achieve high accuracy but fail to identify high-risk patients (low recall).
Outcome Complexity: Predicting Length of Stay (LOS) is notoriously difficult due to skewed distributions and institutional factors not captured in patient-level data.
Generalizability: Prior studies often rely on single-site cohorts or ICU-only populations, limiting their applicability to broader, multi-site hospital systems.
Confounding: Observational data regarding treatments (e.g., Remdesivir) suffers from "confounding by indication," where clinicians preferentially treat sicker patients, complicating causal inference.

2. Methodology

Study Design and Data Source

Cohort: A retrospective cohort study using the National COVID Cohort Collaborative (N3C) data, harmonized to the OMOP Common Data Model.
Scope: 263,619 adults hospitalized with confirmed SARS-CoV-2 infection across 51 U.S. sites between May 2020 and June 2025.
Inclusion/Exclusion: Included first hospitalizations for COVID-19; excluded pregnant patients, those with BMI outside plausible ranges, and sites with no inpatient antiviral use.

Predictors and Outcomes

Predictors: Demographics (age, sex, race), BMI, pre-existing comorbidities (e.g., hypertension, diabetes, heart failure), prior healthcare utilization, vaccination status, and hospital site identifier.
Outcomes:
1. Hospital Length of Stay (LOS): Continuous variable (log-transformed).
2. In-hospital Mortality: Binary (death during hospitalization or discharge to hospice).
3. 60-day All-Cause Mortality: Binary (death within 60 days of admission).
Exposure Analysis: Baseline characteristics and unadjusted outcomes were compared between patients who received Remdesivir and those who did not.

Machine Learning Pipeline

Algorithms: Four model architectures were developed and compared:
1. Penalized Linear/Logistic Regression (Elastic Net).
2. Random Forest.
3. XGBoost.
4. Multilayer Perceptron (MLP).
Data Preprocessing:
- Continuous features standardized (z-score); categorical features one-hot encoded.
- Missing Data: Handled via Multiple Imputation by Chained Equations (MICE) for the single variable with missingness (prior visit count).
- Class Imbalance: Addressed using SMOTE (Synthetic Minority Over-sampling Technique) applied within cross-validation folds for classification tasks.
Evaluation Metrics:
- Classification: AUROC, Precision, Recall, F1-score, Brier score, Calibration plots, and Decision Curve Analysis (DCA).
- Regression: $R^2$ , RMSE, and MAE.
- Interpretability: SHAP (SHapley Additive exPlanations) values and permutation importance.
Subgroup Analysis: All models were re-evaluated in a pre-specified subgroup of patients aged ≥65 years.

3. Key Results

Cohort Characteristics & Remdesivir Exposure

Remdesivir Group: Patients treated with Remdesivir ( $n=103,536$ , 39.3%) were significantly older, had higher BMI, and higher comorbidity burdens compared to untreated patients.
Unadjusted Outcomes: The Remdesivir group had higher unadjusted mortality (9.6% in-hospital vs. 6.6% untreated; 12.5% 60-day vs. 9.3% untreated), confirming confounding by indication.

Length of Stay (LOS) Prediction

Performance: Poor across all models. The best model (XGBoost) achieved an $R^2$ of only 0.059.
Drivers: SHAP analysis identified hospital site as a top predictor, alongside age, complicated diabetes, and prior visit count. This suggests institutional factors (discharge protocols, bed capacity) drive LOS more than patient-level clinical features.

Mortality Prediction

Discrimination (AUROC): Models achieved moderate discrimination.
- In-hospital mortality: Best AUROC 0.721 (XGBoost, no SMOTE).
- 60-day mortality: Best AUROC 0.731 (XGBoost, no SMOTE).
The SMOTE Trade-off:
- Without SMOTE: Models achieved higher AUROCs but had near-zero recall (e.g., 0.005) and F1-scores at the default 0.5 threshold, effectively predicting "survival" for everyone.
- With SMOTE: Recall and F1-scores improved significantly (e.g., Random Forest recall increased from 0 to 0.59), but AUROC and precision decreased.
Calibration: SMOTE-augmented models tended to overestimate risk at moderate-to-high predicted probabilities. Models without SMOTE were well-calibrated but clinically useless for identifying high-risk individuals due to low sensitivity.
Age Subgroup (≥65): Performance declined significantly. The best AUROC for 60-day mortality dropped to 0.654, likely due to homogeneity in risk profiles among the elderly.

4. Key Contributions

Large-Scale Multi-Site Validation: One of the largest studies to systematically compare multiple ML architectures for both mortality and LOS using harmonized, multi-site U.S. EHR data spanning multiple pandemic waves.
Methodological Insight on Class Imbalance: The study provides empirical evidence of the critical trade-off between discrimination (AUROC) and calibration/recall when using SMOTE. It highlights that high AUROC does not guarantee clinical utility if the model fails to identify the minority class at standard thresholds.
Feature Importance Convergence: Identified consistent predictors across all model types: Age, Hospital Site, Comorbidity Burden (specifically complicated diabetes, liver/kidney disease), and prior healthcare utilization.
Remdesivir Characterization: Documented the magnitude of confounding by indication in a large, real-world cohort, providing a baseline for future causal inference studies (e.g., propensity score matching).
LOS Prediction Limits: Confirmed that structured EHR data alone is insufficient for predicting LOS, emphasizing the dominance of institutional/hospital-level factors.

5. Significance and Implications

Clinical Decision Support: While mortality risk scores derived from structured EHR data show moderate discrimination, they are currently insufficient for precise individual risk stratification without additional data (e.g., real-time vitals, lab trajectories).
Reporting Standards: The findings argue strongly for reporting threshold-dependent metrics (Recall, F1-score) and calibration alongside AUROC in clinical ML studies. Relying solely on AUROC can mask a model's inability to identify high-risk patients.
Operational Planning: The poor performance of LOS prediction suggests that resource planning cannot rely solely on patient-level ML models; facility-level variables must be integrated.
Future Research:
- Temporal Validation: Future work must account for pandemic waves (variants, vaccination eras) as the N3C date-shifting algorithm limits time-stratified analysis.
- Fairness: Subgroup analyses by race/ethnicity are needed to ensure equitable model performance.
- Causal Inference: The documented baseline differences necessitate advanced causal methods (e.g., target trial emulation) to evaluate treatment effects like Remdesivir.

Conclusion: The study demonstrates that while ML can moderately predict COVID-19 mortality using structured EHR data, the "black box" nature of class imbalance correction and the dominance of institutional factors in LOS prediction require careful interpretation. The results underscore that discrimination metrics alone are insufficient for clinical deployment, and future models must integrate richer, time-varying clinical data to achieve actionable performance.

Predictors of COVID-19 hospital outcomes: a machine learning analysis of the National COVID Cohort Collaborative