Covariate adjustment for hierarchical outcomes and the… — Plain-Language Explanation

Original authors: Hazewinkel, A.-D., Gregson, J., Bartlett, J. W., Gasparyan, S. B., Wright, D., Pocock, S.

Published 2026-03-31

📖 5 min read🧠 Deep dive

Original authors: Hazewinkel, A.-D., Gregson, J., Bartlett, J. W., Gasparyan, S. B., Wright, D., Pocock, S.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a judge at a talent show. Your job is to decide which team is better: Team A (the new treatment) or Team B (the old standard).

In the past, judges often looked at just one thing: "Who got eliminated first?" If a contestant left the show early, they lost. But this is a bit unfair. Imagine Team A has a contestant who gets a minor cough (a small problem) and leaves early, while Team B has a contestant who gets a life-threatening illness (a huge problem) but stays until the end. If you only look at "who left first," Team A looks worse, even though Team B actually had the much more serious issue.

The "Win Ratio" Solution
To fix this, statisticians invented the Win Ratio. Instead of just looking at who left first, they compare every single person from Team A against every single person from Team B, like a giant round-robin tournament.

They use a "priority list" (a hierarchy):

Death (The worst outcome).
Hospitalization (Bad, but not as bad as death).
Quality of Life Score (How you feel day-to-day).

When comparing two people, the judge looks at the top of the list first.

If Person A is alive and Person B died, Person A wins.
If both are alive, the judge looks at the next item: Who got hospitalized first? The one who stayed out of the hospital wins.
If both are alive and never hospitalized, the judge looks at the quality of life score. The one with the better score wins.

The final score is simply: Total Wins for Team A divided by Total Wins for Team B.

The Problem: The "Noise" in the Room
Here is the catch: Not everyone starts the show on equal footing. Some contestants are older, some have worse health, and some have higher "risk scores" before the show even begins.

If Team A happens to get a group of younger, healthier people by pure luck, they will naturally win more matches. This makes the new treatment look too good. Conversely, if Team A gets the "sicker" group, the treatment might look worse than it really is.

In statistics, we call these starting differences covariates. To get a fair result, we need to "adjust" the score to account for these differences. It's like giving a handicap in golf so the game is fair regardless of skill level.

The Old Ways vs. The New Way
Scientists have tried to fix this "noise" before, but the tools were clunky:

The "Weighted" Method: Trying to balance the teams by giving more importance to certain people. It works, but it's hard to explain why a specific factor (like age) mattered.
The "Probability" Method: Good for some things, but it can't calculate the "Win Ratio" directly. It's like trying to measure a circle with a square ruler.
The "Matching" Method: Trying to pair every sick person in Team A with a sick person in Team B. This is messy because you often have leftover people who don't have a match, and you have to throw them out.

The New Solution: The "Ordinal Logistic" Method
The authors of this paper propose a new, elegant tool. Imagine you have a giant spreadsheet where you've lined up every possible matchup between Team A and Team B.

Instead of just counting wins and losses, they use a special mathematical formula (Ordinal Logistic Regression) that asks: "If two people had the exact same starting health, age, and risk factors, who would win?"

This method does three amazing things:

It cleans the noise: It removes the advantage of having a lucky group of healthy people, giving a truer picture of the treatment's power.
It boosts the signal: By removing the "noise" of random differences, the true effect of the treatment becomes clearer. It's like turning up the volume on a radio station while turning down the static. The study becomes more powerful, meaning you need fewer people to prove the treatment works.
It tells a story: Unlike the other methods, this new tool can tell you exactly how much a specific factor (like high blood pressure) changes the odds of winning or losing. It's like a coach saying, "Your treatment works great, but it works even better for people with high blood pressure."

The Results
The authors tested this new method using real data from a major heart failure trial (EMPEROR-Preserved) and thousands of computer simulations.

The Good News: Adjusting for these starting differences made the results more precise and powerful. It didn't hurt the analysis even if the factors weren't important.
The Comparison: The new method worked just as well as the old, complicated methods, but it was easier to use and gave clearer answers about why the treatment worked.
The "Quality of Life" Bonus: They also found that if you include a "quality of life" score in the mix, adjusting for the patient's starting quality of life makes the results even stronger, especially if the starting score is a good predictor of how they will feel later.

The Bottom Line
This paper is a guidebook for judges (statisticians) on how to run a fairer, more powerful talent show. By using this new "Ordinal" method, we can stop worrying about whether one team got lucky with healthier contestants. We can focus on the real question: Does the new treatment actually help patients live longer and feel better?

The authors are essentially saying: "Don't just count the wins. Adjust for the starting line, use our new calculator, and you'll get a clearer, more powerful answer."

1. Problem Statement

Hierarchical Composite Endpoints (HCEs), analyzed using the Win Ratio (or related methods like the Finkelstein-Schoenfeld test), are increasingly used in randomized controlled trials (RCTs), particularly in cardiovascular disease. Unlike traditional time-to-first-event analyses, HCEs prioritize clinically more severe outcomes (e.g., death) over less severe ones (e.g., hospitalization) and can incorporate quantitative measures (e.g., quality of life scores).

While covariate adjustment is standard practice in RCTs for conventional outcomes to improve statistical power and precision, methods for adjusting HCEs are underdeveloped. Existing literature lacks:

Systematic comparisons of adjustment methods for the Win Ratio.
Clear guidance on whether adjustment improves power in HCEs.
A method that provides a conditional treatment effect estimate (adjusting for covariates) specifically for the Win Ratio (most existing methods only estimate the Win Odds or Mann-Whitney probability).

2. Methodology

The authors propose a new method and compare it against three existing approaches using both real-world data and extensive simulations.

A. Proposed Method: Ordinal Logistic Regression

The authors introduce an ordinal logistic regression-based approach to estimate a covariate-adjusted Win Ratio.

Mechanism: Instead of modeling patient-level data, the model is applied to pairwise comparisons between intervention and control patients.
Outcome Variable: For each pair $(i, j)$ $(i, j)$ , a response variable $Y_{ij}$ $Y_{ij}$ is defined:
- $0$: Loss (Control patient has a better outcome).
- $1$: Tie.
- $2$: Win (Intervention patient has a better outcome).
Covariates: The model uses the within-pair difference in covariate values ( $\Delta C = C_i - C_j$ ) as a predictor.
Model: An ordinal logistic regression is fitted:
$\ln\left(\frac{P(Y \le k | \Delta C)}{P(Y > k | \Delta C)}\right) = \alpha_k - \eta \Delta C$
Output: This yields a conditional Win Ratio (the ratio for two patients with identical covariate values) and interpretable odds ratios for the prognostic covariates. It relies on the proportional odds assumption.

B. Comparison Methods

The study compares the proposed method against:

Probability Index Models: Estimates conditional Win Odds (not Win Ratio) using logistic regression on pairwise outcomes.
Randomization-Based (RB) Method: A marginal estimator that adjusts the unadjusted Mann-Whitney probability by subtracting a term based on the difference in covariate means and their association with the outcome. The authors extend this to estimate an adjusted Win Ratio.
Inverse Probability Weighting (IPW): Uses propensity scores to weight pairwise comparisons, creating a marginal estimator that balances baseline characteristics.

C. Validation Strategy

Real Data: Applied to the EMPEROR-Preserved trial (Heart Failure), analyzing a hierarchical outcome of CV death vs. HF hospitalization, adjusting for log NT-proBNP and a full risk score.
Simulations: 10,000 replications across two scenarios:
1. Time-to-Event Composite: Death and Hospitalization (based on ATTRIBUTE-CF trial parameters).
2. Mixed Composite: Death, Hospitalization, and a quantitative measure (KCCQ score, based on EMPULSE trial parameters).
Metrics: Evaluated statistical power, bias, standard error, Type-I error control, and effective sample size gains.

3. Key Contributions

Novel Estimator: Development of the first method to provide a covariate-adjusted Win Ratio via ordinal regression, offering a conditional treatment effect estimate analogous to the Cox model.
Methodological Comparison: A comprehensive benchmarking of four adjustment strategies (Ordinal, Probability Index, RB, IPW) specifically for hierarchical outcomes.
Extension of RB Method: Adapting the randomization-based method (previously limited to Win Odds) to estimate an adjusted Win Ratio.
Practical Guidance: Demonstrating that covariate adjustment is "worthwhile" for HCEs, providing clear evidence of power gains without loss of efficiency for non-prognostic covariates.

4. Results

Empirical Application (EMPEROR-Preserved)

Unadjusted Win Ratio: 1.251 ( $p=0.0012$ ).
Adjusted (Log NT-proBNP): 1.260 ( $p<0.001$ ).
Adjusted (Full Risk Score): 1.280 ( $p<0.001$ ).
Observation: Adjustment moved the point estimate further from the null (1.0) and increased statistical significance, mirroring results seen in conventional Cox models. The ordinal model also provided interpretable odds ratios for the covariates (e.g., higher NT-proBNP increased the odds of a "loss").

Simulation Findings

Prognostic Covariates: Adjusting for prognostic variables consistently increased statistical power.
- In the time-to-event scenario, power increased from 86% (unadjusted) to ~90% (adjusted), equivalent to a **15% increase in sample size**.
- Gains were comparable to or greater than those seen in conventional Cox models.
Non-Prognostic Covariates: No meaningful loss in power was observed when adjusting for non-prognostic variables.
Quantitative Components: In mixed outcomes (e.g., KCCQ scores), power gains depended heavily on the correlation between baseline and follow-up values.
- High correlation (0.75) led to massive efficiency gains (~80-90% increase in effective sample size).
- Using residuals from a linear regression of follow-up on baseline was slightly more efficient than standard covariate adjustment but altered the estimand (shifting focus to "improvement" rather than absolute status).
Type-I Error: All methods maintained correct Type-I error rates under the null hypothesis.

5. Significance and Conclusion

Efficiency: Covariate adjustment is highly beneficial for HCEs, offering power gains similar to those in traditional analyses. This allows for smaller sample sizes or higher power in existing trials.
Interpretability: The proposed ordinal regression method fills a critical gap by allowing researchers to:
1. Estimate the Win Ratio (a clinically intuitive metric) rather than just Win Odds.
2. Obtain a conditional treatment effect (tailored to specific patient characteristics).
3. Quantify the prognostic value of covariates directly within the model.
Recommendation: The authors recommend the broader adoption of covariate adjustment in RCTs using hierarchical outcomes. They specifically advocate for their ordinal method due to its ease of implementation, interpretability, and ability to provide both marginal and conditional insights, though they acknowledge the trade-off between conditional (model-dependent) and marginal (model-robust) estimands.

The paper concludes that with the availability of worked examples (in R) and planned software updates (Stata), the barrier to implementing these advanced adjustment methods is low, facilitating more precise and informative clinical trial analyses.

Covariate adjustment for hierarchical outcomes and the win ratio: how to do it and is it worthwhile?