Novel g-computation algorithms for time-varying actions with recurrent and semi-competing events

Imagine you are trying to figure out how a specific habit—like smoking—changes a person's life over several decades. You want to know: "If nobody had ever smoked, would fewer people have high blood pressure? Would fewer people have died?"

This sounds simple, but in the real world, life is messy. People get sick, they stop smoking, they start again, and sadly, some people pass away before the study ends. This creates a statistical nightmare for researchers.

Here is a simple breakdown of what this paper does, using some everyday analogies.

The Problem: The "Dead End" and the "Moving Target"

The researchers are dealing with two specific problems that make standard math fail:

The "Dead End" (Semi-Competing Events):
Imagine a video game where your character has two ways to lose: getting a "High Blood Pressure" badge or getting the "Game Over" (Death) screen.
- If you get the "High Blood Pressure" badge, you can still keep playing. You might get better, or you might get worse.
- But if you hit "Game Over" (Death), the game stops instantly. You can't get the "High Blood Pressure" badge after you die.
- The Trap: Old methods often just delete anyone who hits "Game Over" from the data. But this is cheating! It's like saying, "We only counted the people who survived, so our game is safer than it really is." If you ignore the people who died, you might think smoking isn't that bad because the people who died (and would have had high blood pressure) are gone from the stats.
The "Moving Target" (Time-Varying Confounding):
Imagine you are tracking a runner.
- At the start, they smoke.
- Because they smoke, they get tired (a change in their body).
- Because they are tired, they decide to stop smoking.
- Because they stopped smoking, they start running faster.
- The Trap: The runner's current state (tiredness) is caused by their past action (smoking), but it also changes their future action (quitting). Standard math gets confused here because the "cause" and the "effect" are tangled up like a ball of yarn.

The Solution: A New "Time-Travel" Algorithm

The authors, led by Alena Sorensen D'Alessio, invented two new computer algorithms (called g-computation) to solve this. Think of these algorithms as a Time-Travel Simulator.

Instead of just looking at the real data, the computer creates a "Parallel Universe" simulation:

The Setup: The computer takes real people from the study (like the Add Health study, which followed thousands of Americans from their teens to their 50s).
The "What If" Scenario: The computer says, "Okay, let's pretend nobody in this group ever smoked."
The Simulation Loop: The computer runs the timeline forward, year by year.
- It asks: "If this person didn't smoke, would they still be alive? Would they have high blood pressure?"
- It uses complex math to guess what would happen to their health, their weight, and their habits based on what actually happened to similar people in the real world.
- Crucially: If a person "dies" in the simulation, the computer remembers that. It doesn't delete them. It counts them as a death, but it also asks, "If they hadn't died, would they have had high blood pressure?" This fixes the "Dead End" problem.
The Comparison: The computer runs the simulation twice:
- Universe A: Everyone smokes (the real world).
- Universe B: No one smokes (the intervention).
- It then compares the two universes to see the difference in death rates and high blood pressure rates.

The Results: What Did They Find?

The researchers tested their new "Time-Travel Simulator" in two ways:

The Test Drive (Simulation): They created fake data where they knew the exact answer. They tried their new method against old methods.
- Result: The old methods were wrong (biased). They either ignored the deaths or got confused by the changing habits. The new method was accurate, like a GPS that actually knows the traffic.
The Real World Test (Smoking & Blood Pressure): They applied their method to real data from the "Add Health" study (people aged 18 to 51).
- The Question: What if we prevented all smoking from young adulthood to middle age?
- The Finding: If no one had smoked, the study predicts that:
  - Fewer people would have died (about 1.6% fewer).
  - Fewer people would have high blood pressure (about 1.1% fewer).
- Why it matters: The new method showed that the risk of death is real and significant. Old methods that just "deleted" the dead people would have missed this connection entirely.

The Big Picture

Think of this paper as upgrading the map for epidemiologists (the mapmakers of public health).

Old Map: "If you die, you disappear from the map. If your habits change, we get confused."
New Map: "We track you even if you die, and we understand that your habits change because of your health, and your health changes because of your habits."

The authors are saying: "As we study older and older groups of people, more of them will pass away during the study. If we don't use these new tools, we will get the wrong answers about how to keep people healthy. These new algorithms are the key to understanding the true impact of things like smoking, diet, or exercise over a lifetime."

They even made the code (the "blueprints" for the simulator) available for free so other scientists can use it to solve similar puzzles in their own research.

Here is a detailed technical summary of the paper "Novel g-computation algorithms for time-varying actions with recurrent and semi-competing events."

1. Problem Statement

The paper addresses a critical gap in causal inference within epidemiology, specifically regarding longitudinal studies with long follow-up periods. Two major challenges often arise simultaneously in these settings:

Time-Varying Confounding: Variables that are affected by prior exposure and subsequently affect future exposure and the outcome (e.g., health status affecting future smoking habits and future disease risk). Standard regression methods fail here due to bias.
Semi-Competing Events: A scenario where a terminal event (e.g., death) precludes a non-terminal, recurrent event (e.g., hypertension), but the non-terminal event does not preclude the terminal event.
- The Conflict: Existing methods typically handle one issue but not the other. Standard g-computation handles time-varying confounding but often treats terminal events as simple censoring (which induces bias by implicitly assuming the intervention prevents death without altering the disease hazard). Conversely, methods for semi-competing events often assume baseline (fixed) actions only, failing to account for dynamic interventions.
- The Consequence: Simply censoring individuals who die leads to non-ignorable bias and misinterpretation of public health interventions, particularly in aging research where mortality rates increase over time.

2. Methodology

The authors propose two novel g-computation algorithms (Standard and Iterated Conditional Expectation [ICE]) designed to estimate causal effects in the presence of both time-varying actions and semi-competing events.

Data Structure and Definitions

States: The outcome $Y_{i,k}$ $Y_{i, k}$ is modeled as a multistate variable with three states:
1. State 1: Alive, no intermediate event (e.g., no hypertension).
2. State 2: Alive, with intermediate event (e.g., hypertension).
3. State 3: Dead (Terminal/Absorbing state).
Action Plan: A deterministic sequence of actions $a^*$ over time (e.g., "prevent smoking" at all time points).
Estimand: The authors define a vector of causal effects $\psi(\tau)$ $ψ (τ)$ representing the difference in proportions between two action plans for:
- The intermediate event (Prevalence Difference).
- The terminal event (Risk Difference).
- Note: The "no event" state is inferred by the constraint that probabilities sum to 1.

Identification Assumptions

The method relies on standard causal inference assumptions adapted for this setting:

Causal Consistency: Observed outcomes match potential outcomes under the observed treatment.
Time-Varying Action Exchangeability: Potential outcomes are independent of the action plan given the history of covariates and actions.
Positivity: There is a non-zero probability of following any action plan given the history.
Time-Varying Censoring Exchangeability: Censoring is independent of potential outcomes given history.

The Algorithms

Both algorithms utilize multinomial logistic regression to model transitions between the three states.

Standard G-Computation:
- Fits regression models for the outcome and time-varying covariates at each time step.
- Simulates a large population (Monte Carlo) by sampling baseline covariates and iteratively predicting future states ( $Y$ ) and covariates ( $L$ ) under a specific action plan.
- If a simulated individual enters the terminal state (Death), they are removed from further simulation steps (absorbing state).
- Calculates the final proportion of individuals in each state.
ICE G-Computation (Iterated Conditional Expectation):
- A computationally efficient alternative that avoids full data simulation.
- Fits a multinomial model for the outcome at the final time point conditional on the full history.
- Works backwards in time (from $\tau$ down to 1), iteratively predicting the probability of being in a specific state at time $k$ based on the predicted probabilities at time $k+1$ .
- This approach effectively integrates over the distribution of time-varying confounders without generating synthetic datasets.

Variance Estimation

Standard: Non-parametric bootstrap.
ICE: Can use the empirical sandwich variance estimator to avoid the computational burden of bootstrapping.

3. Key Contributions

Methodological Extension: The first extension of g-computation to simultaneously handle time-varying actions and semi-competing events (recurrent intermediate events + terminal events).
Dual Estimators: Provides both a standard simulation-based approach and an ICE approach, offering flexibility for different computational constraints.
Open-Source Implementation: The authors provide R and Python code to facilitate adoption by the research community.
Theoretical Rigor: Formalizes the identification assumptions and the estimand vector for multistate outcomes in this specific context.

4. Results

Simulation Study

Setup: Monte Carlo simulations ( $N=500, 2000$ $N = 500, 2000$ ) comparing the proposed estimators against two alternatives:
1. Alt 1: G-computation for semi-competing events but only with baseline actions (ignores time-varying confounding).
2. Alt 2: ICE g-computation with time-varying actions but treating death as censoring (ignores semi-competing nature).
Findings:
- Bias: The proposed estimators were approximately unbiased for both the intermediate and terminal events.
- Coverage: Both proposed methods achieved ~95% confidence interval coverage.
- Performance: The alternative estimators showed significant bias and poor coverage (e.g., Alt 1 had 52% coverage for the intermediate event; Alt 2 could not estimate the terminal event risk).
- Precision: The proposed estimators had the lowest Root Mean Squared Error (RMSE).

Applied Example (Add Health Cohort)

Context: Investigated the effect of preventing cigarette smoking (Waves III-V) on prevalent hypertension and death (Wave VI) in a cohort of young adults aging into midlife ( $N \approx 13,909$ ).
Findings:
- Hypertension: Preventing smoking would reduce the prevalence of hypertension by 1.1 percentage points (18.4% vs. 19.5%).
- Death: Preventing smoking would reduce the risk of death by 1.6 percentage points (3.9% vs. 5.5%).
- Comparison: The novel estimator provided more precise estimates (narrower confidence intervals) than the alternative methods. Alt 1 underestimated the death risk reduction, while Alt 2 assumed zero risk of death under the intervention.

5. Significance and Implications

Aging Research: As longitudinal cohorts (like Add Health) age, the rate of death increases. This method provides a necessary tool to analyze chronic disease development without the bias introduced by ignoring death as a competing risk.
Public Health Decision Making: By correctly modeling the interplay between death and intermediate diseases, policymakers can better understand the true impact of interventions (e.g., smoking cessation) on both morbidity and mortality.
Beyond Chronic Disease: The framework is applicable to other settings involving intercurrent events in clinical trials (e.g., treatment discontinuation or death) where per-protocol effects need to be estimated in the presence of informative censoring.
Future Directions: The authors note the potential to extend this to multiple intermediate states, stochastic action plans, and inverse probability weighting (IPW) combinations for doubly robust estimation.

In summary, this paper fills a critical methodological void, offering a robust, validated, and implementable solution for causal inference in complex longitudinal studies where time-varying interventions interact with the risk of death.

Novel g-computation algorithms for time-varying actions with recurrent and semi-competing events

The Problem: The "Dead End" and the "Moving Target"

The Solution: A New "Time-Travel" Algorithm

The Results: What Did They Find?

The Big Picture

1. Problem Statement

2. Methodology

Data Structure and Definitions

Identification Assumptions

The Algorithms

Variance Estimation

3. Key Contributions

4. Results

Simulation Study

Applied Example (Add Health Cohort)

5. Significance and Implications

More like this

Modeling extremal dependence in multivariate and spatial problems: a practical perspective

Identifying Treatment Effect Heterogeneity with Bayesian Hierarchical Adjustable Random Partition in Adaptive Enrichment Trials

Comparative e-backtests for general risk measures

Estimating the distance at which narwhal (Monodon monoceros)(\textit{Monodon monoceros})(Monodon monoceros) respond to disturbance: a penalized threshold hidden Markov model

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

Estimating the distance at which narwhal $(\textit{Monodon monoceros})$ respond to disturbance: a penalized threshold hidden Markov model