Efficient estimation of cumulative incidence curves via data fusion with surrogates: application to integrated analysis of vaccine trial and immunobridging data

Imagine you are a chef who invented a delicious new soup (a vaccine) that cured a specific type of cold (a virus). You tested it on thousands of people in a big kitchen (a Phase 3 clinical trial) and proved it works. You know exactly how the soup tastes and how it helps people.

Now, a new strain of cold virus is spreading. You want to update your soup recipe to fight this new strain. But you can't just run a massive, expensive, years-long trial on thousands of people again. That would take too long and cost too much money.

Instead, you want to use a shortcut: Immunobridging.

This paper is about a sophisticated statistical "recipe" that allows scientists to predict how well a new, updated soup will work, by mixing two different bowls of data together.

The Two Bowls of Data

The Big Bowl (The Historical Trial): This contains data from the original, massive trial. It has thousands of people. We know what they ate (the vaccine), how their immune system reacted (the "immune marker," like a antibody level), and whether they got sick later.
The Small Bowl (The Immunobridging Study): This is a smaller, newer study with fewer people. They took the new soup recipe. We know how their immune system reacted to the new soup, but we don't know if they got sick yet because the study is too short or too small to wait for that.

The Goal: We want to predict the "sickness curve" (how many people get sick over time) for the people in the Small Bowl, using the Big Bowl as a guide.

The Problem: It's Not Just About the Soup

If the only thing that mattered was the soup, we could just say, "If the immune reaction is the same, the protection is the same." But life is messy.

Different People: The people in the Big Bowl might be older or have different health histories than the people in the Small Bowl.
Different Viruses: The virus in the Big Bowl might be slightly different from the one in the Small Bowl.
The "Hidden" Factor: The immune marker (like an antibody level) is a surrogate. It's a stand-in for the real thing. It's like looking at a car's speedometer to guess how far you've traveled. Usually, it works. But sometimes, the speedometer is right, but the engine is sputtering, or the road is icy.

The authors realized that simply comparing the speedometers (antibody levels) isn't enough. You have to account for the driver (the person's baseline health) and the road conditions (the virus strain).

The Solution: The "Time-Travel" Calculator

The authors developed a mathematical method that acts like a time-travel calculator. It asks a "What if?" question:

"If the people in the Small Bowl (who got the new soup) had been in the Big Bowl (where we know who got sick), how would their sickness curve look?"

To do this, they use a technique called Data Fusion. They stitch the two bowls together using three clever rules (assumptions):

The "Same Driver" Rule: If two people have the same health history and the same antibody level, they should have the same risk of getting sick, regardless of which study they were in.
The "No Magic" Rule: The new soup doesn't have any secret superpowers that the old soup doesn't have, other than what is shown by the antibody level. If the antibody levels are the same, the protection should be the same. (If the new soup had a secret "super-ingredient" that the antibody test couldn't see, this rule would break).
The "Bridge" Rule: We can mathematically translate the risk from the Big Bowl to the Small Bowl by adjusting for the differences in the people and the virus.

The Result: A Crystal Ball for Vaccines

By using this method, the authors can draw a Cumulative Incidence Curve. Think of this as a weather forecast for getting sick.

Without this method: We would have to wait years to see if the new vaccine works.
With this method: We can look at the antibody levels from the small study, mix them with the "sickness history" from the big study, and instantly generate a prediction: "Based on this data, about 5% of people might get sick in 3 months, and 7% in 6 months."

Real-World Example: The COVAIL Trial

The paper tested this on real data from the COVAIL trial (a study on COVID-19 boosters).

They had data from an old trial with the original virus.
They had a small new study with a "bivalent" booster (designed for new variants).
They used their method to predict how well the new booster would work against the new variant.

They even used the method to check their own rules. They asked: "Does the new vaccine have any secret superpowers the antibody test missed?" By comparing their prediction to the actual results (when they finally got them), they found that the antibody test did miss some protection. This proved that their method is smart enough to tell us when our assumptions are wrong!

Why This Matters

This paper is like giving regulators (the FDA) and scientists a super-powered telescope.

Speed: We don't have to wait years to approve new vaccines for new variants.
Safety: We can approve vaccines faster, saving lives during outbreaks.
Efficiency: We don't need to run massive, expensive trials for every single update.

In short, the authors built a bridge between the past (what we know worked) and the future (what we need to know will work), allowing us to cross over with confidence, even when we don't have all the data yet.

1. Problem Statement

The paper addresses a critical challenge in vaccinology and regulatory science: estimating the counterfactual cumulative incidence (risk) and relative vaccine efficacy (VE) of an updated or investigational vaccine regimen in a target population, using data from two distinct sources:

Historical Phase 3 Trial ( $D_h$ ): Contains baseline covariates ( $X$ ), immune marker levels ( $S$ ), and time-to-event clinical outcomes ( $T$ ) for an approved vaccine and placebo.
Immunobridging Study ( $D_b$ ): Contains baseline covariates ( $X$ ) and immune marker levels ( $S$ ) for an approved vaccine and an investigational vaccine in a target population, but lacks clinical outcome data ( $T$ ).

Key Challenges:

Post-randomization Surrogates: Unlike standard generalizability problems, the immune marker $S$ is a post-randomization variable. Its distribution depends on treatment assignment and baseline covariates.
Heterogeneity: Differences exist in baseline covariate distributions, individual immune response variability, circulating pathogen strains (force of infection), and transmissibility between the historical and target populations.
Censoring: Clinical outcomes are subject to right-censoring.
Multiple Serotypes: For pathogens like Dengue or Influenza, multiple strains circulate simultaneously, requiring cause-specific risk estimation.

The goal is to fuse these datasets to estimate the cumulative incidence curve for the investigational vaccine in the target population without conducting a new, costly, and time-consuming Phase 3 efficacy trial.

2. Methodology

The authors develop a semiparametric framework based on the Neyman-Rubin potential outcomes model.

A. Statistical Framework & Identification

The paper defines three specific immunobridging tasks:

Task I: Same pathogen strain, different populations/vaccines.
Task II: Different pathogen strains (variant-matching), different populations.
Task III: Multiple co-circulating serotypes.

Key Identification Assumptions:

Consistency & Randomization: Standard assumptions for the trial data.
Overlap: The support of covariates in the target population is a subset of the historical trial.
Conditional Exchangeability (Assumption 3): The potential outcome under a joint intervention (Vaccine $A$ , Immune Marker $S=s$ ) is exchangeable between the historical and target populations, conditional on $X$ .
No Controlled Direct Effects (Assumption 4): In the immunobridging setting, the effect of the investigational vaccine ( $A=1'$ $A = 1^{'}$ ) vs. the approved vaccine ( $A=1$ $A = 1$ ) on the outcome is fully mediated by the immune marker $S$ $S$ . If $S$ $S$ is held constant, the vaccines have no direct effect on the risk.
- Note: For Task II (different strains), a Variant-Invariant Model is introduced, linking risks across strains via a relative transmissibility factor (often anchored at 1 or treated as a sensitivity parameter).

Identification Formulas:
The target parameter $R(a; \Gamma=1)$ (risk of investigational vaccine in target population) is identified using three equivalent representations (Inverse Probability Weighting, Outcome Regression, and Doubly Robust forms). For example, one representation involves weighting historical trial data by the ratio of densities of the immune marker and treatment assignment between the two populations.

B. Estimation Strategy

The authors propose Efficient Influence Function (EIF) based estimators to achieve semiparametric efficiency.

Debiased Machine Learning (DML): To handle high-dimensional nuisance functions (e.g., conditional mean outcomes, propensity scores, density ratios), the authors employ cross-fitted machine learning (e.g., SuperLearner, Random Forests, Cox models).
Multiple Robustness: The estimators are consistent if any one of three specific sets of nuisance models is correctly specified.
Censoring: The framework incorporates inverse probability of censoring weighting (IPCW) to handle right-censored time-to-event data.
Inference: The paper derives the EIF for the cumulative incidence curve, allowing for pointwise and uniform inference (confidence bands) via the asymptotic linearity of the estimator.
Monotonicity: Isotonic regression is suggested to enforce the non-decreasing property of cumulative incidence curves.

C. Extension to Competing Risks (Task III)

For multiple serotypes, the method is extended to estimate cause-specific cumulative incidence functions. The EIF is derived for competing risks data, accounting for the fact that infection by one serotype precludes infection by another.

3. Key Contributions

Methodological Extension: Extends existing data integration frameworks (Athey et al., Gilbert et al.) from average treatment effects to time-to-event (survival) outcomes with informative censoring and cause-specific risks.
Immunobridging Framework: Provides a rigorous causal inference framework specifically for immunobridging, explicitly addressing the complexities of post-randomization surrogates and variant-matching.
Efficient Estimators: Develops multiply robust, semiparametrically efficient estimators using debiased machine learning, ensuring valid inference even when nuisance models are complex.
Practical Application: Applies the method to the COVAIL trial (COVID-19 Variant Immunologic Landscape) to estimate the risk of Omicron infection for a bivalent booster and to test the validity of the "no controlled direct effects" assumption.

4. Results

A. Simulation Studies

Setup: Simulated data with varying degrees of covariate overlap ( $c$ ) and sample sizes ( $n_h$ ).
Performance: The proposed DML estimator demonstrated:
- Minimal Bias: Bias was consistently below 1% across scenarios.
- Correct Coverage: 95% confidence intervals achieved coverage rates close to the nominal level (93–97%).
- Robustness: The estimator remained stable even under moderate overlap and varying sample sizes.

B. Application to COVAIL Trial

Task I Application: The authors estimated the counterfactual cumulative incidence of COVID-19 for a BA.4/5 + Prototype bivalent booster (Stage 4) using data from Omicron-containing vaccines (Stage 2) as the historical source.
- Result: The estimated cumulative incidence at Day 188 was 6.8% (95% CI: 0 to 15.9%).
Assumption Testing (Task II/III Application): The authors tested the No Controlled Direct Effects assumption by comparing the counterfactual risk of Omicron-containing vaccines (estimated under the assumption) vs. their actual observed risk.
- Finding: The counterfactual risk (31.8%) was significantly higher than the actual observed risk (14.5%). The difference (9.3% to 27.5%) was statistically significant.
- Implication: This suggests that the Day 15 neutralizing antibody titer does not fully capture the protective mechanism of the Omicron-containing vaccines; there are likely other immune pathways (e.g., T-cell responses) or mechanisms not captured by the surrogate that contribute to protection.

5. Significance

Regulatory Impact: Provides a statistically rigorous pathway for regulatory agencies (like the FDA) to approve updated vaccines (e.g., for new variants) based on immunogenicity data alone, reducing the need for massive Phase 3 efficacy trials.
Causal Clarity: Moves beyond simple correlation of immune markers to causal estimation of risk, explicitly accounting for confounding and heterogeneity between trial settings.
Flexibility: The framework is adaptable to various pathogens (Dengue, Influenza) with multiple serotypes and can incorporate sensitivity analyses for untestable assumptions (like the no direct effect assumption).
Efficiency: By leveraging large historical datasets and modern machine learning, it maximizes the information extracted from smaller immunobridging studies.

In summary, this paper offers a comprehensive statistical toolkit for the next generation of vaccine development, enabling faster, more efficient, and scientifically robust evaluation of updated vaccine regimens against evolving pathogens.