Robust evaluation of treatment effects in longitudinal studies with truncation by death or other intercurrent events

Imagine you are running a race to see which of two running shoes (Shoe A and Shoe B) helps runners go the fastest. You have a huge group of people, and you randomly give half of them Shoe A and the other half Shoe B.

But here's the problem: Life happens.

Some runners get a flat tire (they need a "rescue" shoe). Some runners get a sudden cramp and have to stop running entirely (they "drop out"). And tragically, some runners get sick and have to leave the race forever (they "die").

In a standard race analysis, if you just look at who finished the race, you might get the wrong answer.

If Shoe A makes people run so fast they get exhausted and quit, but Shoe B makes them run slowly and finish, a standard analysis might say "Shoe B is better!" even if Shoe A was actually the faster shoe.
If the race stops because someone died, you can't measure their speed after that point. If you just ignore the dead runners, you are only looking at the "survivors," which might be a very different group of people than the ones who died.

This paper introduces a new, clever way to judge the shoes called PLOT (Pairwise Last Observation Time).

The Old Ways vs. The New Way

1. The "Last Look" Method (LOCF):
Imagine you look at the runners at the very end of the race. If a runner quit early, you just take their last known speed and pretend they kept running at that speed until the finish line.

The Flaw: This is like guessing a runner's speed at mile 26 based on how fast they were at mile 1. It's a guess, and it can be very wrong.

2. The "Survivors Only" Method:
You only look at the people who finished the race.

The Flaw: This is biased. Maybe Shoe A is so tough that only the strongest, most elite runners can finish in it, while the weaker ones quit. If you only look at the finishers, you might think Shoe A is amazing, but it's actually just filtering out the weak runners.

3. The "What If" Method (Hypothetical):
You try to imagine a magical world where no one ever got a flat tire or died, and calculate what would have happened.

The Flaw: This requires making up a lot of rules about a world that doesn't exist. If your rules are slightly wrong, your whole answer is wrong.

The New "PLOT" Method: The "Handshake" Analogy

The authors propose a method called PLOT. Instead of looking at the end of the race or guessing the future, they use a "pairing" strategy.

Imagine you take two runners: one wearing Shoe A and one wearing Shoe B. You pair them up based on how fit they were at the start (their "baseline").

Now, you watch them run together.

Runner A gets a flat tire at mile 10.
Runner B gets a cramp at mile 12.

The race for this specific pair stops at mile 10. Why? Because that is the last moment both of them were still running freely.

You compare their speeds exactly at mile 10. You don't guess what they would have done at mile 20. You don't ignore Runner A because they stopped. You just say, "Okay, at the moment the first person in this pair stopped, here is how they compared."

You do this for thousands of pairs. Some pairs stop at mile 2, some at mile 15. You average all these "last moments" together.

Why is this better?

Fairness: You are comparing the shoes at the exact same point in time for both runners. You aren't giving Shoe A an unfair advantage because its runners lasted longer.
No Magic: You aren't guessing what would happen if they didn't get sick. You are using real data from the moment they were actually running.
Robustness: Even if the "flat tires" happen for weird reasons (like unmeasured health issues), this method is very good at ignoring those tricks and telling you the truth about the shoes.

The "Synchronous" Concept

Think of it like a dance.

In the old methods, one dancer might keep dancing after the music stops, while the other stops early. You try to guess how the early stopper would have danced.
In the PLOT method, you stop the music the moment either dancer stops. You look at how they were dancing together right up until that moment. It's a "synchronous" snapshot.

The Real-World Test (The DEVOTE Trial)

The authors tested this idea on a real medical study about diabetes drugs. In this study, some patients died or had to stop taking the drug.

Old methods gave confusing or very uncertain results.
The PLOT method gave a clear, confident answer: One drug (Insulin Degludec) caused significantly fewer dangerous low-blood-sugar events than the other, even when accounting for people dying or dropping out.

The Bottom Line

When a study is ruined by people dropping out, getting sick, or dying, don't just guess what would have happened, and don't just ignore the people who left.

Instead, pair people up, watch them run until the first one stops, and compare them right there. It's a simpler, fairer, and more honest way to see which treatment really works.

Here is a detailed technical summary of the paper "Robust Evaluation of Treatment Effects in Longitudinal Studies with Truncation by Death or Other Intercurrent Events" by Baklicharov, Van Lancker, and Vansteelandt.

1. Problem Statement

Longitudinal randomized clinical trials (RCTs) are frequently complicated by Intercurrent Events (ICEs) such as treatment switching, rescue medication, dropout, and truncation by death. These events occur after treatment initiation but before the final outcome measurement, complicating standard Intention-to-Treat (ITT) analyses.

Existing causal inference frameworks for handling ICEs face significant limitations:

Hypothetical Estimands: These quantify effects under a counterfactual protocol where ICEs are prevented. They rely on strong, often unverifiable structural assumptions and are sensitive to positivity violations (e.g., when rescue medication is deterministic under certain conditions).
Principal Stratum (PS) Estimands: Examples include the Survivor Average Causal Effect (SACE). While they avoid "what-if" protocol reasoning, they target unidentifiable subpopulations (e.g., "always survivors") and require strong assumptions (e.g., monotonicity or specific independence assumptions) that are difficult to verify and often implausible.
Composite/LOCF Approaches: Last Observation Carried Forward (LOCF) or composite endpoints (e.g., assigning a worst value after death) often introduce selection bias or obscure whether the treatment affects the outcome, the ICE, or both.

The core challenge is to develop a method that evaluates treatment efficacy robustly in the presence of ICEs without relying on unverifiable structural assumptions or extrapolating to unobserved post-ICE outcomes.

2. Methodology: Pairwise Last Observation Time (PLOT)

The authors propose a novel framework based on Pairwise Last Observation Time (PLOT) estimands.

Core Concept

Instead of comparing outcomes at a fixed time point or extrapolating to a hypothetical world, PLOT compares treated and untreated individuals at the same time point: specifically, the last time point before either individual in a pair experiences an ICE.

Pairwise Matching: The method considers pairs of individuals $(i, j)$ where $i$ is treated and $j$ is control.
Synchronous Comparison: The comparison occurs at time $M_{ij} = \min(T_i, T_j, t)$ , where $T$ is the time to the first ICE. This ensures both individuals are compared at a time when both were still ICE-free.

Estimands

The paper defines two main types of estimands:

PLOT (Unconditional):
$\Psi_t = E\left[ Y^1(\min(T^1, T^{*0}, t)) - Y^{*0}(\min(T^1, T^{*0}, t)) \right]$
This contrasts the potential outcomes of a treated individual and an independent control individual at the time the first of the two experiences an ICE.
CPLOT (Conditional):
$\Phi_t = E\left[ E\left[ Y^1(\min(T^1, T^{*0}, t)) - Y^{*0}(\min(T^1, T^{*0}, t)) \mid L = L^* \right] \right]$
This conditions on baseline covariates $L$ . By matching on $L$ , the method accounts for prognostic factors influencing both the outcome and the timing of ICEs. The authors argue CPLOT is preferable as it "truncates" fewer measurements and reduces bias from unmeasured confounding of ICE timing.

Estimation Strategy

The authors develop asymptotically efficient, model-free estimators using Double/Debiased Machine Learning (DML):

Nuisance Parameters: The method requires estimating:
- Conditional survival probabilities: $p_{a,s}(L) = P(T > s \mid A=a, L)$ .
- Conditional mean outcomes: $\mu_{a,s,u}(L) = E[Y(s) \mid A=a, T > u, L]$ .
Algorithms:
- Survival probabilities are estimated using Survival Random Forests.
- Outcome regressions are estimated using Super Learner (an ensemble of GLMs, random forests, and GAMs).
Cross-Fitting: To avoid overfitting and ensure valid inference when using flexible machine learning, the authors employ $K$ -fold cross-fitting.
Influence Functions: The estimators are constructed using Efficient Influence Functions (EIF), allowing for $\sqrt{n}$ -consistency and asymptotic normality even when nuisance parameters are estimated at slower rates (provided specific rate conditions are met).

3. Key Contributions

Novel Estimand Definition: Introduction of PLOT and CPLOT estimands that anchor comparisons to the observed data distribution rather than hypothetical scenarios. This avoids the need for extrapolation to unobserved post-ICE outcomes.
Robustness to Positivity Violations: Unlike Inverse Probability Censoring Weighting (IPCW) or SACE estimators, PLOT/CPLOT do not require strict positivity (i.e., non-zero probability of remaining ICE-free for all covariate patterns). They remain valid even when ICEs are nearly deterministic under specific conditions.
Theoretical Guarantees:
- The paper proves that under randomization and consistency, these estimands are identified without structural assumptions.
- It establishes conditions under which the estimands equal zero under the null hypothesis of no treatment effect. Specifically, it shows that CPLOT is robust to unmeasured confounding of the ICE timing unless that unmeasured factor also modifies the additive time effect on the outcome.
- It derives the EIF and proves asymptotic efficiency under the semiparametric model.
Interpretation Framework: The authors provide a method to recover "hypothetical" treatment effects (e.g., SACE or hypothetical effects without ICEs) from the CPLOT estimand by correcting for the "dilution" caused by early ICEs, assuming a linear or time-specific effect accumulation model.

4. Results

Simulation Studies

The authors conducted extensive simulations comparing PLOT/CPLOT against SACE, IPCW, LOCF, and naive survivor analyses across four settings:

Setting 1 (Standard): PLOT/CPLOT showed negligible bias and nominal coverage (95%). IPCW and LOCF showed moderate bias and undercoverage. SACE was unbiased but had extremely high variance.
Setting 2 (Treatment Switching/Positivity Violation): In a scenario with time-varying confounding and near-positivity violations (where SACE and IPCW fail), PLOT/CPLOT maintained unbiasedness and correct coverage. Other methods exhibited severe bias (e.g., IPCW coverage dropped to 54.5%).
Setting 3 & 4 (Binary/Count Outcomes): CPLOT with cross-fitting consistently outperformed competitors in terms of bias, variance, and coverage.

Application: DEVOTE Trial

The method was applied to the DEVOTE trial (Type 2 Diabetes), analyzing severe hypoglycemia with truncation by death.

Findings: The CPLOT estimator indicated a significant reduction in severe hypoglycemic events for Insulin Degludec (IDeg OD) compared to Insulin Glargine (IGlar OD) (Additive contrast: $-0.043$ , $p=0.0007$ ).
Comparison:
- SACE: The standard SACE estimator yielded a similar point estimate ( $-0.049$ ) but with an uninformative, extremely wide confidence interval ( $-4.60, 4.50$ ).
- CPLOT: Provided a much narrower, informative confidence interval ( $-0.085, -0.012$ ).
- LOCF/Survivors: Naive methods yielded similar point estimates but relied on the strong (and likely false) assumption that survival is independent of treatment assignment.
Conclusion: The CPLOT approach offered the precision of a standard analysis with the robustness of a causal method, avoiding the instability of SACE in the presence of death.

5. Significance and Implications

Regulatory Relevance: The method offers a robust alternative for regulatory decision-making where hypothetical estimands are often criticized for relying on unverifiable assumptions. It provides a "data-driven" evaluation of treatment efficacy that stays close to the observed reality.
Handling Truncation by Death: It addresses the "truncation by death" problem without requiring the identification of the "always survivor" stratum, which is often impossible in practice.
Flexibility: By utilizing modern machine learning (Super Learner, Random Forests) within a DML framework, the method adapts to complex, high-dimensional baseline covariates and non-linear relationships without requiring correct parametric model specification.
Generalizability: While focused on RCTs, the framework extends naturally to observational studies provided baseline covariates sufficiently adjust for confounding.

In summary, this paper presents a rigorous, robust, and efficient methodology for evaluating longitudinal treatment effects in the presence of intercurrent events, overcoming the sensitivity and instability issues that plague existing causal inference frameworks.

Robust evaluation of treatment effects in longitudinal studies with truncation by death or other intercurrent events

The Old Ways vs. The New Way

The New "PLOT" Method: The "Handshake" Analogy

The "Synchronous" Concept

The Real-World Test (The DEVOTE Trial)

The Bottom Line

1. Problem Statement

2. Methodology: Pairwise Last Observation Time (PLOT)

Core Concept

Estimands

Estimation Strategy

3. Key Contributions

4. Results

Simulation Studies

Application: DEVOTE Trial

5. Significance and Implications

More like this

Modeling extremal dependence in multivariate and spatial problems: a practical perspective

Identifying Treatment Effect Heterogeneity with Bayesian Hierarchical Adjustable Random Partition in Adaptive Enrichment Trials

Comparative e-backtests for general risk measures

Estimating the distance at which narwhal (Monodon monoceros)(\textit{Monodon monoceros})(Monodon monoceros) respond to disturbance: a penalized threshold hidden Markov model

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

Estimating the distance at which narwhal $(\textit{Monodon monoceros})$ respond to disturbance: a penalized threshold hidden Markov model