Covariate adjustment for hierarchical outcomes and the win ratio: how to do it and is it worthwhile?

This paper introduces and validates an easily implemented ordinal logistic regression method for covariate adjustment in win ratio analyses of hierarchical outcomes, demonstrating that adjusting for prognostic variables consistently improves statistical power without compromising efficiency.

Hazewinkel, A.-D., Gregson, J., Bartlett, J. W., Gasparyan, S. B., Wright, D., Pocock, S.

Published 2026-03-31
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a judge at a talent show. Your job is to decide which team is better: Team A (the new treatment) or Team B (the old standard).

In the past, judges often looked at just one thing: "Who got eliminated first?" If a contestant left the show early, they lost. But this is a bit unfair. Imagine Team A has a contestant who gets a minor cough (a small problem) and leaves early, while Team B has a contestant who gets a life-threatening illness (a huge problem) but stays until the end. If you only look at "who left first," Team A looks worse, even though Team B actually had the much more serious issue.

The "Win Ratio" Solution
To fix this, statisticians invented the Win Ratio. Instead of just looking at who left first, they compare every single person from Team A against every single person from Team B, like a giant round-robin tournament.

They use a "priority list" (a hierarchy):

  1. Death (The worst outcome).
  2. Hospitalization (Bad, but not as bad as death).
  3. Quality of Life Score (How you feel day-to-day).

When comparing two people, the judge looks at the top of the list first.

  • If Person A is alive and Person B died, Person A wins.
  • If both are alive, the judge looks at the next item: Who got hospitalized first? The one who stayed out of the hospital wins.
  • If both are alive and never hospitalized, the judge looks at the quality of life score. The one with the better score wins.

The final score is simply: Total Wins for Team A divided by Total Wins for Team B.

The Problem: The "Noise" in the Room
Here is the catch: Not everyone starts the show on equal footing. Some contestants are older, some have worse health, and some have higher "risk scores" before the show even begins.

If Team A happens to get a group of younger, healthier people by pure luck, they will naturally win more matches. This makes the new treatment look too good. Conversely, if Team A gets the "sicker" group, the treatment might look worse than it really is.

In statistics, we call these starting differences covariates. To get a fair result, we need to "adjust" the score to account for these differences. It's like giving a handicap in golf so the game is fair regardless of skill level.

The Old Ways vs. The New Way
Scientists have tried to fix this "noise" before, but the tools were clunky:

  • The "Weighted" Method: Trying to balance the teams by giving more importance to certain people. It works, but it's hard to explain why a specific factor (like age) mattered.
  • The "Probability" Method: Good for some things, but it can't calculate the "Win Ratio" directly. It's like trying to measure a circle with a square ruler.
  • The "Matching" Method: Trying to pair every sick person in Team A with a sick person in Team B. This is messy because you often have leftover people who don't have a match, and you have to throw them out.

The New Solution: The "Ordinal Logistic" Method
The authors of this paper propose a new, elegant tool. Imagine you have a giant spreadsheet where you've lined up every possible matchup between Team A and Team B.

Instead of just counting wins and losses, they use a special mathematical formula (Ordinal Logistic Regression) that asks: "If two people had the exact same starting health, age, and risk factors, who would win?"

This method does three amazing things:

  1. It cleans the noise: It removes the advantage of having a lucky group of healthy people, giving a truer picture of the treatment's power.
  2. It boosts the signal: By removing the "noise" of random differences, the true effect of the treatment becomes clearer. It's like turning up the volume on a radio station while turning down the static. The study becomes more powerful, meaning you need fewer people to prove the treatment works.
  3. It tells a story: Unlike the other methods, this new tool can tell you exactly how much a specific factor (like high blood pressure) changes the odds of winning or losing. It's like a coach saying, "Your treatment works great, but it works even better for people with high blood pressure."

The Results
The authors tested this new method using real data from a major heart failure trial (EMPEROR-Preserved) and thousands of computer simulations.

  • The Good News: Adjusting for these starting differences made the results more precise and powerful. It didn't hurt the analysis even if the factors weren't important.
  • The Comparison: The new method worked just as well as the old, complicated methods, but it was easier to use and gave clearer answers about why the treatment worked.
  • The "Quality of Life" Bonus: They also found that if you include a "quality of life" score in the mix, adjusting for the patient's starting quality of life makes the results even stronger, especially if the starting score is a good predictor of how they will feel later.

The Bottom Line
This paper is a guidebook for judges (statisticians) on how to run a fairer, more powerful talent show. By using this new "Ordinal" method, we can stop worrying about whether one team got lucky with healthier contestants. We can focus on the real question: Does the new treatment actually help patients live longer and feel better?

The authors are essentially saying: "Don't just count the wins. Adjust for the starting line, use our new calculator, and you'll get a clearer, more powerful answer."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →