Estimating the new event-free survival

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: A Race with a "False Start" Rule

Imagine you are watching a marathon. In this race, the goal is to see how long runners can stay in the race without tripping, getting injured, or quitting. This is called Event-Free Survival (EFS).

In the world of leukemia (specifically Acute Myeloid Leukemia), doctors have a new rule from the "Referees" (the FDA and European LeukemiaNet). They say: "If a runner shows signs of giving up early, we shouldn't wait until they actually stop running to mark it. Instead, let's mark it as if they tripped right at the starting line (Day 1)."

This rule makes sense medically because if a treatment isn't working, the patient is effectively a "failure" from the start, even if the doctor doesn't confirm it until a week later.

The Problem:
The authors of this paper realized that the standard way of calculating race results (the Kaplan-Meier estimator) gets confused by this new rule.

Think of it like this:

The Standard Method: If a runner drops out of the race before the official start line is even crossed (because they got sick or left the stadium early), the standard method assumes they never ran at all. It ignores them.
The New Rule: We want to count them as a "Day 1 failure."
The Glitch: If you just take the standard method and apply the new rule, you end up underestimating how many people actually failed on Day 1. You are missing the people who dropped out before anyone could check their status. It's like a race where you only count the people who tripped after the starting gun, ignoring the ones who tripped while tying their shoes.

The Solution: A New Way to Count

The authors (Judith, Maral, Kaya, Axel, and Hartmut) built a new "scorekeeper" system to fix this.

1. The "Traffic Light" System (Competing Risks)

Instead of just looking at one big pile of "failures," they split the failures into two distinct lanes:

Lane 1 (The Day 1 Failures): People who didn't respond to treatment.
Lane 2 (The Later Failures): People who responded well at first but then relapsed or died later.

They used a mathematical tool called the Aalen-Johansen estimator. Imagine this as a smart camera that tracks every runner individually. Even if a runner leaves the stadium early (censoring), the camera knows they were supposed to be in "Lane 1" and calculates the probability that they would have failed on Day 1, even if we didn't see the exact moment. This gives a fair, unbiased count of Day 1 failures.

2. The "Cure" Analogy (Mixture Cure Models)

The paper also uses a concept called a "Cure Model." This sounds like a magic pill, but it's actually a statistical trick.

Imagine the patients are divided into two invisible groups:

Group A (The "Cured"): These are the people who responded to treatment and will never relapse. In this specific study, they are the people who didn't fail on Day 1.
Group B (The "Not Cured"): These are the people who will eventually fail.

The authors realized that "failing on Day 1" is mathematically similar to "being cured" in a different sense. If you fail on Day 1, you are "done" with the race immediately. If you don't fail on Day 1, you are in the "race" for the long haul.

By using this model, they can ask two separate questions:

Did the treatment help people avoid failing immediately? (The Day 1 drop).
Did the treatment help people stay in remission longer? (The long-term survival).

This is important because a drug might be great at keeping people alive long-term but terrible at getting them into remission immediately. Standard methods often blur these two effects together.

What Did They Find?

They tested their new method on real data from a major German leukemia study (AMLSG 09-09).

The Interim Check (Mid-Race): When they looked at the data halfway through the study, there was a lot of "missing data" (people who left the study early). The old method (Kaplan-Meier) said the Day 1 failure rate was about 9-10%. Their new method said it was actually 10-11%.
- Why the difference? The old method missed the people who left early. The new method correctly estimated that those missing people likely would have failed on Day 1.
The Final Check (Finish Line): By the end of the study, almost everyone had been followed long enough. The "missing data" problem disappeared. Both the old method and the new method gave almost the exact same answer.

The Takeaway:
If a study has very few people dropping out early, the old method is fine. But if many people drop out before the "Day 1" check is complete, the old method is lying to you by making the treatment look slightly better than it is. The new method fixes this lie.

Why Should You Care?

This paper is like a manual for fixing a broken ruler.

For Doctors: It ensures they aren't underestimating how many patients fail a treatment immediately. This helps in making better decisions about which drugs to use.
For Patients: It means the statistics used to approve new drugs are more accurate. If a drug has a high "Day 1 failure" rate, the new method catches it, even if the data is messy.
For Science: It shows that when rules change (like the new FDA rule), our math tools need to change with them, or we get the wrong results.

In short: Don't just shift the data and hope for the best. Use the right math to count the failures correctly, especially when the race is messy.

1. Problem Statement

In Acute Myeloid Leukemia (AML) clinical trials, Event-Free Survival (EFS) is a primary endpoint defined as the time to death, relapse, or treatment failure. Recent guidelines from the US FDA (2020) and the European LeukemiaNet (ELN, 2022) recommend recoding treatment failures (failure to achieve complete remission by a predetermined landmark) as events occurring on Day 1 of randomization, rather than at the time of assessment.

The Core Issue:

Data Modification Bias: This recommendation shifts event times from the assessment date back to Day 1.
Censoring Problem: If a patient is censored (lost to follow-up or study ends) before the assessment landmark, their treatment failure status is unknown.
Kaplan-Meier Limitation: When the standard Kaplan-Meier Estimator (KME) is applied to this recoded data, it only accounts for observed treatment failures. It fails to account for the probability of treatment failure among censored patients who would have failed had they been followed longer.
Consequence: The KME underestimates the "Day-1 drop" in the survival curve (the immediate probability of failure), leading to a biased, overly optimistic EFS estimate. This bias propagates through the entire survival curve. Furthermore, the resulting EFS becomes a mixture of a discrete component (Day 1 events) and a continuous component (post-Day 1 events), complicating the interpretation of traditional hazard ratios.

2. Methodology

The authors propose a novel estimation framework that avoids data modification by treating the problem within a Competing Risks and Mixture Cure Model framework using original data.

A. Competing Risks Framework

Instead of shifting data, the authors define two competing event types in the original data:

Type 1 Event (Treatment Failure): Observed at the assessment time (or before the landmark).
Type 2 Event (Other EFS events): Relapse or death after complete remission (occurring after Day 1).

The EFS probability $EFS(t)$ is derived as:
$EFS(t) = 1 - CIF_1(u_0) - CIF_2(t)$
Where:

$CIF_1(u_0)$ is the Cumulative Incidence Function for Type 1 events evaluated at the landmark time $u_0$ .
$CIF_2(t)$ is the Cumulative Incidence Function for Type 2 events evaluated at time $t$ .
This is estimated using the Aalen-Johansen estimator, which provides an unbiased estimate of the cumulative incidence in the presence of competing risks and censoring.

B. Variance and Inference

Variance Estimation: A variance estimator for the combined EFS estimate is derived, incorporating the covariance between the two cumulative incidence functions.
Confidence Intervals (CI): Point-wise CIs are calculated using a log-log transformation.
Confidence Bands: To address the dependency between event times and censoring in event-driven interim analyses (where Efron's bootstrap fails), the authors utilize the Wild Bootstrap. This allows for the construction of simultaneous confidence bands covering the entire EFS curve.
Simultaneous Testing: Maximum-type tests are developed to simultaneously test for differences between groups at Day 1 (the drop) and a later time point $t$ .

C. Link to Mixture Cure Models

To separate the effects of treatment on the probability of treatment failure versus the risk of relapse/death among those who do not fail treatment:

The authors map the problem to a Mixture Cure Model.
Cured Proportion: Equated to the probability of not having a Day-1 treatment failure ( $1 - \pi$ ).
Uncured Proportion: Modeled using a parametric distribution (Weibull) for the time-to-event among those who did not fail treatment.
Likelihood: A specific observed likelihood function is constructed to handle the four possible patient states (observed failure, observed relapse, censored after remission, censored before remission).

3. Key Contributions

Unbiased Estimator: Derivation of an unbiased EFS estimator that corrects the underestimation of the Day-1 drop caused by censoring in standard KME approaches.
Methodological Bridge: Formal establishment of a link between FDA/ELN recoding recommendations and Mixture Cure Models, allowing for the separate analysis of treatment failure rates and post-failure survival risks.
Robust Inference Tools: Development of variance estimators, simultaneous confidence bands (via Wild Bootstrap), and joint testing procedures suitable for event-driven trials with dependent censoring.
Software Implementation: The methods are based on standard statistical theory and are implementable in standard software (e.g., R), with code provided in supplementary materials.

4. Results

The methods were applied to the AMLSG 09-09 study (a Phase 3 trial comparing standard therapy vs. standard + Gemtuzumab Ozogamicin) using both interim and final analysis data.

Interim Analysis (High Censoring):
- There was significant censoring before the landmark.
- The standard KME on recoded data estimated a Day-1 drop of ~9.8% (standard) and ~13.7% (experimental).
- The new Aalen-Johansen method estimated a larger drop: ~14.7% (standard) and ~9.8% (experimental).
- Finding: The KME underestimated the Day-1 drop by approximately 0.01–0.02 due to unobserved failures in censored patients. This bias propagated through the survival curve.
Final Analysis (Low Censoring):
- With sufficient follow-up, censoring before the landmark was negligible.
- Both methods yielded nearly identical results, confirming that the bias disappears when censoring is minimal.
Simultaneous Testing & Cure Models:
- Simultaneous tests showed no significant difference between groups at Day 1 or at 2/10 years in the interim/final analyses (though a signal appeared for 10-year EFS in the final analysis, $p=0.088$ ).
- Cure Model Insights: The odds ratio for treatment failure (Day 1) was not significant, but the Hazard Ratio for relapse-free survival (among those without failure) was significant ($HR=0.69, p=0.003$).
- Implication: The treatment reduced the risk of relapse/death for those who achieved remission, despite a non-significant trend toward higher Day-1 failure rates. A standard Cox model on the whole curve would have obscured this nuance.

5. Significance

Clinical Trial Design: The paper provides a rigorous statistical solution for implementing FDA/ELN recommendations without introducing bias in trials with significant interim censoring.
Interpretation of Endpoints: It clarifies that EFS is now a hybrid endpoint (discrete + continuous). Traditional hazard ratios are insufficient; researchers must analyze the "cure" fraction (treatment failure rate) and the "survival" fraction (relapse-free survival) separately.
Statistical Rigor: By utilizing the Aalen-Johansen estimator and Wild Bootstrap, the authors ensure that confidence bands and p-values are valid even in complex, event-driven trial settings where standard assumptions (i.i.d. data) are violated.
Practical Utility: The study demonstrates that while the difference between methods may be moderate in low-censoring scenarios, the proposed method is essential for accurate inference in studies with high early censoring or large Day-1 failure rates.