Estimands and the Choice of Non-Inferiority Margin under ICH E9(R1)

Imagine you are a judge in a cooking competition. You have a famous, award-winning chef (let's call him Chef Reference) who is known for making the world's best soup. Now, a new chef (Chef New) wants to enter the competition. Chef New doesn't need to prove their soup is better than Chef Reference's; they just need to prove it's not significantly worse.

To decide if Chef New passes, the judges need a rule: "How much worse can Chef New's soup be before we say, 'No, this isn't good enough'?" This rule is called the Non-Inferiority Margin.

This paper is about how to set that rule fairly, especially when we are looking at old recipes (historical data) to see how good Chef Reference actually is.

The Big Problem: "What Exactly Are We Measuring?"

In the past, when judges looked at Chef Reference's old recipes, they just looked at the final taste. But a new rulebook (called ICH E9(R1)) says: "Wait a minute! You need to define exactly what you are measuring."

The paper argues that the "rule" (the margin) changes depending on how you define the question. Here are two ways to ask the question, using our soup analogy:

The "Real World" Question (Treatment Policy): "How does the soup taste if we include everyone who ate it, even if they stopped eating halfway through or added their own salt?"
- Analogy: This measures the soup as it actually exists in the real world, with all its messy interruptions.
The "Perfect World" Question (Hypothetical): "How would the soup taste if no one ever stopped eating it and no one ever added their own salt?"
- Analogy: This measures the pure potential of the recipe, imagining a perfect scenario where nothing goes wrong.

The Paper's Main Point: The "worse than" rule (the margin) must match the question. If you ask the "Real World" question, your rule must be based on Real World data. If you ask the "Perfect World" question, your rule must be based on Perfect World data. You cannot mix them up.

The Simulation: The "Weight Loss" Race

The authors ran a computer simulation (like a video game) to prove this. They imagined a race where people try to lose weight.

The Scenario: Some runners stop running because they get tired (an "Intercurrent Event").
The Result:
- If you measure the race including the people who stopped (Real World), the average weight loss is lower.
- If you measure the race ignoring the people who stopped (Perfect World), the average weight loss is higher.

The Lesson: The "gap" between the winner and the loser changes depending on whether you count the people who quit or not. If you use the wrong gap to set your rule for the new chef, you might let in a bad chef or reject a good one.

Example 1: The "STEP" Trials (The Clear Case)

The authors looked at real weight-loss trials using a drug called Semaglutide (Chef Reference).

These trials were very modern and followed the new rulebook. They clearly stated: "We measured two things: the Real World result and the Perfect World result."
The Finding: The "Perfect World" result showed a huge weight loss (e.g., -12.6%). The "Real World" result showed less weight loss (e.g., -10.9%) because people stopped taking the drug or used other diets.
The Dilemma: If the new trial wants to ask the "Real World" question, but they use the "Perfect World" number to set their rule, they will set the bar too high. They might reject a drug that is actually good enough for real life.
The Solution: You must pick the number that matches your specific question.

Example 2: The "SCALE" Trials (The Messy Case)

Then they looked at older trials using a drug called Liraglutide. These trials were done before the new rulebook existed.

The Problem: The old reports didn't say clearly, "We measured the Real World" or "We measured the Perfect World." They just gave a number.
The Detective Work: The authors had to act like detectives. They looked at the fine print, the flow charts of who dropped out, and the statistical methods used.
- Clue: Did they count the data from people who quit? (If yes, it's Real World).
- Clue: Did they try to guess what would have happened if people didn't quit? (If yes, it's Perfect World).
The Challenge: It's like trying to guess the recipe of a cake from 20 years ago when the baker only wrote "It tasted good." You have to make an educated guess. The paper warns that if you guess wrong, your rule (the margin) will be wrong, and the whole trial could fail.

Why Does This Matter?

If you set the rule (the margin) based on the wrong question:

Too Strict: You might reject a new medicine that is actually safe and effective for real people, just because it didn't match a "Perfect World" fantasy.
Too Loose: You might approve a medicine that is actually terrible in the real world, because you compared it to a "Perfect World" number that was too high.

The Takeaway

The paper concludes with a simple message for scientists and regulators:

"Don't just look at the number; look at the story behind the number."

Before you decide if a new drug is "good enough," you must agree on exactly what "good enough" means.

Are we talking about the messy, real world?
Or are we talking about a perfect, hypothetical world?

Once you agree on the story, you can pick the right historical data to set the rule. If you mix the stories, the rule breaks, and the whole competition becomes unfair.

In short: The margin (the rule) is not a fixed number like "10%." It is a flexible target that changes shape depending on how you ask the question. You must match your question to your data, or you'll be measuring the wrong thing.

Here is a detailed technical summary of the paper "Estimands and the Choice of Non-Inferiority Margin under ICH E9(R1)" by Mütze et al.

1. Problem Statement

The paper addresses a critical gap in the application of the ICH E9(R1) estimand framework to non-inferiority (NI) trials. While the framework is well-established for superiority trials, its implications for NI trials remain underexplored in regulatory guidance.

Core Issue: The choice of a non-inferiority margin ( $M$ ) relies on historical evidence of the reference treatment's efficacy ( $M_1$ ) and a clinically acceptable loss of effect ( $M_2$ ). Current regulatory guidance (FDA 2016, EMA 2000) predates ICH E9(R1) and does not account for how intercurrent events (ICEs) and their handling strategies (e.g., treatment policy vs. hypothetical) influence the definition of the treatment effect.
The Conflict: The estimand defines the specific clinical question (including how ICEs like treatment discontinuation or rescue medication use are handled). If the estimand in a planned NI trial differs from the estimand implicitly or explicitly targeted in historical trials used to derive $M_1$ , the constancy assumption (a prerequisite for NI trials) may be violated. This leads to an $M_1$ that does not accurately reflect the historical efficacy relevant to the new trial's specific question.
Knowledge Gap: There is no standardized methodology for deriving $M_1$ when historical trials lack explicit estimand definitions or when the planned trial's estimand cannot be perfectly reconstructed from historical data.

2. Methodology

The authors employ a combination of simulation studies and two practical case studies based on weight management trials to illustrate the impact of estimands on margin derivation.

A. Simulation Study

Objective: To quantify how different ICE handling strategies and ICE frequencies affect the estimated treatment effect and, consequently, the non-inferiority margin.
Setup:
- Population: Simulated patient journeys for a weight management setting (Reference vs. Placebo).
- Endpoint: Relative percentage weight reduction at week 68.
- ICE: "Use of other anti-obesity intervention" (irreversible).
- Strategies: Compared a Hypothetical estimand (no ICEs) against a Treatment Policy estimand (includes ICE effects).
- Mechanism: The probability of an ICE was modeled as a function of the weight loss trajectory. Post-ICE behavior was assumed to revert to the placebo trajectory.
Analysis: The authors varied the ICE probability ( $\beta_0$ ) to observe changes in the mean treatment effect.

B. Case Study 1: Historical Trials with Explicit Estimands (STEP Trials)

Context: A new NI trial for a weight management drug using Semaglutide 2.4 mg as the reference.
Challenge: The planned trial uses a hybrid estimand (Treatment Policy for discontinuation; Hypothetical for rescue medication). However, the historical STEP trials reported two distinct, pure estimands:
1. Pure Treatment Policy (TP).
2. Pure Hypothetical (HYP).
Method: A Bayesian meta-analysis was performed on selected STEP trials to estimate $M_1$ for both the TP and HYP estimands separately. The authors then discussed how to select an appropriate $M_1$ for the hybrid target.

C. Case Study 2: Historical Trials without Explicit Estimands (SCALE Trials)

Context: A new NI trial comparing a new drug to Liraglutide 3.0 mg.
Challenge: The historical SCALE trials were conducted prior to ICH E9(R1). They did not explicitly define estimands, and individual participant data (IPD) was not available.
Method: The authors developed a retrospective reconstruction process:
1. Reviewed protocols, statistical analysis plans (SAPs), and CONSORT flow diagrams.
2. Inferred the likely ICE handling strategy based on the statistical methods used (e.g., Last Observation Carried Forward (LOCF) vs. Mixed Models for Repeated Measures (MMRM)).
3. Identified which trials could be pooled to derive $M_1$ based on inferred strategy alignment.
4. Addressed the "constancy assumption" violation regarding the emergence of new ICEs (e.g., increased use of rescue meds) not present in historical data.

3. Key Contributions

Demonstration of Estimand Specificity: The paper provides empirical evidence that the historical treatment effect ( $M_1$ ) is not a fixed value but is specific to the chosen estimand. Different strategies for handling the same ICEs yield significantly different effect sizes.
Quantification of Bias: The simulation showed that as the frequency of ICEs increases, the Treatment Policy effect size diminishes (approaching the placebo effect), whereas the Hypothetical effect size remains constant. This proves that using a "one-size-fits-all" margin is scientifically invalid.
Methodological Framework for Retrospective Estimation: For pre-ICH E9(R1) trials, the authors propose a structured approach to infer estimands by analyzing:
- Statistical methods (e.g., MMRM on completers vs. LOCF on FAS).
- Disposition flow charts (to distinguish administrative withdrawals from true ICEs).
- Regulatory context at the time of the trial.
Hybrid Estimand Solution: In Case Study 1, the authors demonstrate that when a planned trial targets a hybrid estimand not present in history, the derived $M_1$ likely lies between the pure TP and pure HYP estimates. They argue for selecting the more conservative (smaller) margin to ensure robustness, though this must be clinically justified.

4. Key Results

Simulation Results:
- The treatment effect for the Hypothetical estimand remained constant regardless of ICE frequency.
- The treatment effect for the Treatment Policy estimand decreased linearly as ICE frequency increased.
- Conclusion: $M_1$ is highly sensitive to ICE frequency and strategy. A margin derived from a historical trial with low ICE frequency may be too large (overly optimistic) for a new trial with high ICE frequency if the strategy is Treatment Policy.
Case Study 1 (STEP Trials):
- Hypothetical $M_1$ : -12.6% (95% CI: -14.8, -10.3).
- Treatment Policy $M_1$ : -10.9% (95% CI: -13.0, -8.85).
- Implication: The difference (~1.7%) is substantial. Using the Hypothetical margin for a Treatment Policy trial would risk failing to demonstrate true non-inferiority regarding the "real-world" effect.
Case Study 2 (SCALE Trials):
- Successfully reconstructed estimands for three of four SCALE trials.
- Identified that the SCALE Sleep Apnea trial was unsuitable due to short duration (32 weeks) and different population.
- Derived a Treatment Policy $M_1$ of -2.94% (95% CI: -6.87, -2.94) based on the remaining trials.
- Highlighted the difficulty in aligning historical data where the "rescue medication" ICE was not considered relevant in the past but is critical in the new trial, necessitating clinical discussion to adjust $M_2$ .

5. Significance and Recommendations

Regulatory Impact: The paper argues that regulators and sponsors must explicitly link the non-inferiority margin to the specific estimand. The upcoming EMA draft guideline (2025) acknowledges this, but the paper provides the practical "how-to."
Assay Sensitivity: The constancy assumption is threatened if the historical estimand differs from the current one. The paper emphasizes that assay sensitivity must be re-evaluated whenever the estimand changes.
Practical Recommendations (Table 6):
1. Explicit Linkage: Ensure $M_2$ is explicitly linked to the primary estimand.
2. Cross-Functional Collaboration: Statisticians and clinicians must jointly evaluate historical trials for estimand alignment.
3. Sensitivity Analysis: Perform sensitivity analyses on meta-analyses to test the robustness of $M_1$ against different estimand assumptions.
4. Transparency: Document all assumptions made when reconstructing estimands from historical data.
5. Regulatory Agreement: Agree on the margin and the underlying estimand logic with regulators before trial initiation.
6. Data Sharing: Encourage the sharing of Individual Participant Data (IPD) to allow for retrospective re-analysis of historical trials under the new framework.

Conclusion: The paper establishes that the non-inferiority margin is not a static statistical constant but a dynamic value dependent on the clinical question (estimand). Ignoring the estimand framework in NI trials risks deriving margins that are either too lenient (failing to protect patients) or too strict (making trials unfeasible). A rigorous, transparent, and estimand-specific approach to margin derivation is now essential.

Estimands and the Choice of Non-Inferiority Margin under ICH E9(R1)

The Big Problem: "What Exactly Are We Measuring?"

The Simulation: The "Weight Loss" Race

Example 1: The "STEP" Trials (The Clear Case)

Example 2: The "SCALE" Trials (The Messy Case)

Why Does This Matter?

The Takeaway

1. Problem Statement

2. Methodology

A. Simulation Study

B. Case Study 1: Historical Trials with Explicit Estimands (STEP Trials)

C. Case Study 2: Historical Trials without Explicit Estimands (SCALE Trials)

3. Key Contributions

4. Key Results

5. Significance and Recommendations

More like this

Modeling extremal dependence in multivariate and spatial problems: a practical perspective

Identifying Treatment Effect Heterogeneity with Bayesian Hierarchical Adjustable Random Partition in Adaptive Enrichment Trials

Comparative e-backtests for general risk measures

Estimating the distance at which narwhal (Monodon monoceros)(\textit{Monodon monoceros})(Monodon monoceros) respond to disturbance: a penalized threshold hidden Markov model

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

Estimating the distance at which narwhal $(\textit{Monodon monoceros})$ respond to disturbance: a penalized threshold hidden Markov model