Evaluation of SOFA-2 Score Performance Across… — Plain-Language Explanation

Original authors: Ellen, J. G., Hao, S., Gao, C. A., Arias, M. D. P., Viola, M., Wong, A.-K. I., Mattie, H., Parker, W., Haidau, C., Matos, J., Chaves, R. C. d. F., Celi, L. A.

Published 2026-03-11

📖 5 min read🧠 Deep dive

View on medRxiv ↗PDF ↗

CC BY 4.0

Original authors: Ellen, J. G., Hao, S., Gao, C. A., Arias, M. D. P., Viola, M., Wong, A.-K. I., Mattie, H., Parker, W., Haidau, C., Matos, J., Chaves, R. C. d. F., Celi, L. A.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a doctor in a busy emergency room. A patient arrives, looking very sick. You need to decide: How likely is this person to survive their stay in the Intensive Care Unit (ICU)?

To help make this decision, doctors use a "report card" called the SOFA-2 score. Think of this score like a weather forecast for a patient's organs. It checks six different systems (like the heart, lungs, kidneys, and brain) and gives them a grade from 0 to 24.

Low score: The weather is clear; the organs are working fine.
High score: A massive storm is brewing; the organs are failing.

For a long time, doctors have trusted this "weather forecast" to tell them who is in the most danger. Recently, a new, updated version called SOFA-2 was released. It was tested on over 3 million patients around the world and seemed to work perfectly.

But here is the catch: The original test of 3 million people didn't ask, "Does this weather forecast work equally well for everyone?"

That is what this new study by Jacob Ellen and his team wanted to find out. They took the new SOFA-2 score and tested it on a different group of patients (about 64,000 people in Boston) to see if it was fair to everyone, regardless of their age, race, language, or insurance.

The Big Discovery: The "One-Size-Fits-All" Problem

The researchers found that while the SOFA-2 score is a good "weather forecast" overall, it has some serious blind spots. It's like a GPS app that works great for driving on a highway but gets completely confused when you try to drive through a narrow, winding mountain road.

Here are the specific "glitches" they found:

1. The Age Gap: The "Old Car" Analogy

The most shocking finding was about age.

The Young: For patients aged 18–44, the score was a very accurate GPS. It knew exactly who was in trouble.
The Old: For patients aged 75 and older, the score became unreliable. It started underestimating the danger.
The Metaphor: Imagine an old car and a new car both having a flat tire. The new car's warning light flashes brightly (high score = high danger). But the old car's warning light is dim or broken (low score = low danger), even though the old car is actually in more trouble because its engine is already worn out.
The Result: The score told doctors that older patients were safer than they actually were. In reality, older patients with the same score had a much higher chance of dying than younger patients.

2. The Language Barrier: The "Lost in Translation" Effect

The score worked less well for patients who spoke languages other than English.

The Metaphor: Imagine a translator trying to explain a complex medical problem to a doctor who doesn't speak the patient's language. Some details get lost.
The Result: The score was slightly less accurate for non-English speakers. This suggests that the way doctors write down notes or the way patients are treated might differ based on language, and the score didn't catch those subtle differences.

3. The "Missing File" Mystery

The study found something very worrying about patients whose race or language was listed as "Unknown" in the hospital records.

The Metaphor: Imagine a library where some books have no title on the spine. The librarians assume these are just "regular" books. But when they check, they find these "unknown" books are actually the most damaged and dangerous ones in the whole library.
The Result: Patients with missing demographic info had double the death rate of the average patient. The score failed to predict their risk because the data was incomplete. This suggests that when a hospital doesn't know who a patient is, that patient is often in a much more precarious situation.

4. Race and Sex: Mostly Fair, But Not Perfect

Race: For patients whose race was clearly recorded, the score worked fairly well for everyone. However, the study noted that a large chunk of patients (14%) had "Unknown" race, which skewed the data.
Sex: The score was slightly off for women. It tended to think women were doing slightly better than they actually were, while it thought men were doing slightly worse than they actually were.

Why Does This Matter?

Think of the SOFA-2 score as a tool in a doctor's toolbox. If you use a hammer to fix a watch, you might break it. Similarly, if a doctor uses a score that is biased against older people, they might make the wrong decision.

The Risk: If the score tells a doctor, "This 80-year-old is low risk," the doctor might decide not to use a life-saving machine or might stop treatment too early. But because the score was "blind" to the age factor, that patient might have actually needed more help, not less.
The Lesson: You cannot just trust a tool because it works "on average." You have to test it on every type of person who will use it.

The Bottom Line

The authors of this paper aren't saying "Throw away the SOFA-2 score." They are saying, "Use it with caution."

They are calling for a new rule in medicine: Before we let a computer score decide who gets life-saving care, we must check if that score is fair to the elderly, non-English speakers, and those with missing records. If we don't, we risk leaving the most vulnerable patients behind, thinking they are safer than they really are.

In short: A good tool must work for everyone, not just the majority. This study sounded the alarm that the current tool has a few cracks, especially for our oldest patients.

1. Problem Statement

The Sequential Organ Failure Assessment (SOFA) score is a cornerstone of critical care for quantifying organ dysfunction and predicting mortality. A recently updated version, SOFA-2, was validated on over 3 million ICU admissions across 9 countries, demonstrating robust overall predictive validity. However, the original validation study did not systematically evaluate performance across demographic subgroups (age, sex, race/ethnicity, language, insurance).

Given that clinical prediction tools increasingly inform high-stakes decisions (e.g., triage, resource allocation), there is a critical need to determine if SOFA-2 performs equitably across diverse populations or if it inadvertently perpetuates healthcare disparities. Previous iterations of the SOFA score have shown biases regarding race and sex, prompting this study to assess whether the updated SOFA-2 score mitigates or exacerbates these issues.

2. Methodology

Study Design and Data Source

Type: Retrospective cohort study serving as an external validation.
Dataset: MIMIC-IV (Version 3.1), containing deidentified data from Beth Israel Deaconess Medical Center (Boston, MA) from 2008–2022.
Cohort: 64,015 adult ICU admissions (first admission per patient).
Exclusions: ICU stays <6 hours, physiologic values outside clinically plausible ranges.

Variable Definitions

Predictor: First-day SOFA-2 score (0–24), calculated using the worst recorded values in the first 24 hours across six organ systems (Neurological, Cardiovascular, Respiratory, Hepatic, Renal, Coagulation).
- Key Modifications in SOFA-2: Revised $PaO_2/FiO_2$ thresholds (using $SpO_2/FiO_2$ as fallback), inclusion of advanced ventilatory support (NIV, HFNC, ECMO) for maximum respiratory scores, specific vasopressor criteria, and pharmacologic treatment of delirium for neurological scoring.
Outcome: ICU mortality (death during admission or within 6 hours of discharge).
Demographic Subgroups:
- Age: 18–44, 45–64, 65–74, ≥75 years.
- Sex: Male, Female.
- Race/Ethnicity: White, Black, Hispanic, Asian, Other, Unknown.
- Language: English, Non-English, Unknown.
- Insurance: Private, Medicare, Medicaid, Other.

Statistical Analysis

Discrimination: Assessed via Area Under the Receiver Operating Characteristic Curve (AUROC). Differences between subgroups were tested using nonparametric bootstrap resampling (1,000 iterations). A $\Delta AUROC > 0.05$ was considered clinically meaningful.
Calibration: Assessed using calibration intercepts (ideal = 0) and slopes (ideal = 1) derived from logistic regression models fitted to the overall cohort and applied to subgroups.
Software: R version 4.5.0.

3. Key Contributions

First Subgroup Analysis of SOFA-2: This is the first study to evaluate the fairness of the newly published SOFA-2 score across diverse demographic strata.
Identification of Age-Related Bias: The study quantifies a significant decline in predictive accuracy for older adults, a finding with direct implications for triage and prognostication in an aging ICU population.
Analysis of "Unknown" Demographics: The paper highlights that patients with missing demographic data (specifically race/ethnicity and language) represent a high-risk group with distinct mortality profiles and poor model calibration, suggesting that missing data itself is a marker of vulnerability.
External Validation Context: By using MIMIC-IV (a single-center US dataset) against a model trained on global data, the study tests the generalizability of SOFA-2 in a different healthcare setting.

4. Key Results

Overall Performance

AUROC: 0.77 (95% CI: 0.76–0.77), indicating acceptable discrimination.
Calibration: Intercept 0.00, Slope 1.00 (perfect for the overall cohort as the model was fit to this data).

Performance by Subgroup

Age (Significant Finding):
- Discrimination declined sharply with age.
- 18–44 years: AUROC 0.85 (Good).
- ≥75 years: AUROC 0.72 (Acceptable but significantly lower).
- Gap: $\Delta AUROC = -0.14$ (95% CI: -0.16 to -0.11).
- Calibration: Systematic underprediction of mortality in older patients (Intercept = 0.39) and overprediction in younger patients. At a SOFA-2 score of 10, mortality was 19.3% for ages 18–44 vs. 24.5% for ages ≥75.
Language:
- Non-English speakers had significantly lower discrimination (AUROC 0.73) compared to English speakers (AUROC 0.77; $\Delta = -0.04$ ).
- Patients with Unknown language status had the highest mortality (23.1%) and poorest calibration (Intercept = 1.14).
Insurance:
- Medicare recipients (often older) had lower discrimination (AUROC 0.73) compared to Private (0.81) and Medicaid (0.82) patients.
- Calibration showed underprediction for Medicare patients (Intercept = 0.16).
Race/Ethnicity:
- Among documented groups, discrimination was consistent (AUROC 0.74–0.79) with no statistically significant differences.
- Unknown Race/Ethnicity (14.3% of cohort): This group had nearly double the mortality rate (14.1% vs. 7.2%) and poor calibration (Intercept = 0.65), indicating systematic underprediction.
Sex:
- Discrimination was identical (AUROC 0.77 for both).
- Calibration showed slight overprediction for males and underprediction for females, though the gap was smaller than age-related disparities.

5. Significance and Implications

Clinical Caution Required: The SOFA-2 score should be interpreted with caution, particularly for older adults (≥75 years) and non-English speakers, as it systematically underestimates their risk of death. Relying solely on this score for triage or goals-of-care discussions could disadvantage these populations.
Equity in Validation: The study underscores that high overall AUROC does not guarantee equitable performance. Routine equity evaluation across demographic subgroups is essential before the widespread implementation of clinical prediction tools.
Data Quality as a Risk Signal: The poor performance and high mortality in groups with "Unknown" demographic data suggest that missing information may correlate with unmeasured social determinants of health or acute severity that precludes documentation. Excluding these patients from analysis may bias results toward lower-risk populations.
Future Directions: The authors suggest that future iterations of severity scores may need to incorporate frailty, comorbidities, or social determinants to improve accuracy in older and marginalized populations, as acute physiological parameters alone may not capture the full risk profile in these subgroups.

Conclusion: While SOFA-2 maintains good overall predictive validity, it exhibits clinically meaningful disparities, most notably a substantial decline in discrimination for older patients. This highlights the necessity of subgroup-specific validation to ensure fairness in critical care decision-making.

Evaluation of SOFA-2 Score Performance Across Demographic Subgroups: An External Validation Study Using MIMIC-IV