Causal analyses using education-health linked data for… — Plain-Language Explanation

Original authors: De Stavola, B. L. L., Aparicio Castro, a., Nguyen, V. G., Lewis, K. M., Dearden, L., Harron, K., Zylbersztejn, A., Shumway, J., Gilbert, R.

Published 2026-03-19

📖 5 min read🧠 Deep dive

View on medRxiv ↗PDF ↗

CC BY 4.0

Original authors: De Stavola, B. L. L., Aparicio Castro, a., Nguyen, V. G., Lewis, K. M., Dearden, L., Harron, K., Zylbersztejn, A., Shumway, J., Gilbert, R.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: Trying to Figure Out "What Works"

Imagine you are a school principal trying to decide if a new, expensive tutoring program actually helps students stay in class and stop skipping school. You have a massive pile of data on every student in the country—grades, attendance records, health issues, and family backgrounds.

The problem? You can't just look at the data and say, "Oh, students who got tutoring had better attendance, so tutoring works!" Why? Because maybe the students who got tutoring were already the ones with the most support at home, or maybe they were the ones struggling the most to begin with. It's like trying to see if a new umbrella works by looking at people who got wet; you don't know if the umbrella failed or if they just forgot to bring it.

This paper is a "how-to" guide for researchers who want to use these massive piles of data to answer real questions about what works, without running a fake experiment.

The Story: The HOPE Study

The authors are part of a team called HOPE (Health Outcomes for young People throughout Education). They wanted to know: Does special education support (called SEND) actually help kids stay in school and reduce their "unauthorized absences" (skipping class)?

They tried to answer this using a "Causal Roadmap," which is like a GPS for finding the truth in messy data. Here is how they navigated the journey:

1. Sharpening the Question (The "Target Trial" Analogy)

At first, their question was too vague, like asking, "Does exercise make you healthy?" It's too broad.

The Fix: They used a framework called Target Trial Emulation. Imagine they are trying to build a perfect, imaginary science experiment (a "Target Trial") where they could magically assign kids to get special help or not, and then watch what happens.
The Reality Check: Since they couldn't do magic, they had to look at their real-world data and ask: "Can we pretend our data looks like that perfect experiment?"
The Result: They realized they couldn't study every kid. They had to narrow it down to specific groups (like kids with cleft lips or cerebral palsy) where the data was clear enough to make a fair comparison. They also had to define exactly when the help started and when they measured the skipping.

2. The "Simulation" Playground

Before they touched the real, messy data, they built a video game version of the problem.

The Analogy: Think of this like a flight simulator for pilots. Before flying a real plane with passengers, they built a computer simulation where they knew exactly what would happen if they turned the wheel left or right.
Why? They created 10,000 fake students with known "true" outcomes. They knew exactly how much the special help should reduce skipping.
The Lesson: They tested different math tools on this fake data. Some tools gave the right answer, but only if you set them up perfectly. Others gave wrong answers if you made a tiny mistake in the settings. This taught them which tools were the most reliable "flight instruments."

3. The Tools of the Trade (The Math)

To find the truth, they used three main "tools" (statistical methods). The paper explains how these tools behave:

G-computation: Like a chef following a recipe. If you miss one ingredient (a specific detail in the data), the cake (the result) tastes bad. It needs a very precise recipe.
IPW (Inverse Probability Weighting): Like a judge weighing evidence. It gives more weight to the students who are "rare" in the data to make the groups look fair. It's good at spotting if the groups are too different to compare, but it can be a bit shaky (imprecise).
AIPW (The Hybrid): This is the "best of both worlds" tool. It's like having a backup generator. If one part of the math fails, the other part can still save the day. The paper found this to be the most robust tool.

4. The "Time-Travel" Problem

One of the trickiest parts was sustained support.

The Analogy: Imagine a student gets help in Year 1. That help changes their health, which changes their behavior in Year 2, which changes whether they get help in Year 3.
The Trap: If you just use a standard math formula (like a simple regression), you accidentally "block" the path of how the help actually works. It's like trying to measure the speed of a car by blocking the road with a wall; you stop the car, so you can't measure its speed.
The Solution: Their advanced tools (G-computation and IPW) were able to untangle this knot, showing that long-term, sustained help has a bigger impact than just a one-time fix.

The Takeaway: What Should We Learn?

The paper concludes with three main lessons for anyone trying to use big data to make policy decisions:

Be Specific: Don't ask "Does X work?" Ask "Does X work for this specific group, starting at this time, measured this way?"
Practice on Fake Data: Before you trust your results on real people, build a simulation where you know the answer. If your math tools can't solve the fake puzzle, they definitely can't solve the real one.
Check Your Tools: Don't just use one math method. Try a few different ones. If they all point in the same direction, you can be more confident. If they disagree, you need to dig deeper.

In short: This paper is a guidebook for researchers, warning them that big data is powerful but tricky. You have to be a careful detective, use the right tools, and double-check your work with simulations before you tell the world what works.

1. Problem Statement

Policymakers increasingly rely on large-scale administrative data to evaluate the effectiveness of interventions. However, using such data for causal inference presents significant challenges beyond mere data quality issues. These include:

Vague Causal Questions: Initial research questions are often too broad to be mapped onto observational data (e.g., "What is the impact of SEND provision?" lacks specificity regarding timing, duration, and target population).
Confounding: Administrative data often lack granular details on unmeasured confounders, leading to selection bias.
Time-Varying Confounding: In longitudinal settings, interventions and outcomes influence each other over time, and standard regression methods often fail to account for time-varying confounders that are also mediators.
Implementation Gaps: There is a lack of user-friendly, standardized code for applying modern causal inference methods (like Target Trial Emulation) to complex administrative datasets.

The authors use the HOPE (Health Outcomes for young People throughout Education) study as a case study to address the impact of Special Educational Needs and Disability (SEND) provision on health and education outcomes (specifically unauthorized school absences) in England.

2. Methodology

The study employs a rigorous Target Trial Emulation (TTE) framework guided by a "causal roadmap." The methodology involves five iterative steps:

A. Refinement of Causal Questions

The authors moved from a broad question to three specific, well-defined questions regarding Unauthorized Absences (UA):

Q1 (Long-term): Does SEND provision in Year 1 affect UA rates from Year 2 to Year 6?
Q2 (Short-term): Does SEND provision in Year $t$ affect UA rates in Year $t+1$ ?
Q3 (Sustained): Does sustained SEND provision (Years 1–3) affect UA rates up to Year 6?

B. Scope Definition & Population

To ensure the intervention was relevant and data was sufficient for adjustment, the scope was narrowed from the general population to specific clinical cohorts likely to benefit from SEND:

Children with Cleft Lip/Palate (CLP) and Cerebral Palsy (CP).
Further refined to exclude children with additional major abnormalities due to data coarseness, resulting in a target population of $\approx$ 10,000 children.

C. Estimation Targets

The study defined causal effects using the Potential Outcomes (PO) framework:

ATE (Average Treatment Effect): Population-average impact.
ATT (Average Treatment Effect in the Treated): Impact specifically on those who received the intervention.
Metrics: Effects were estimated on both Rate Ratios (RR) and Rate Differences (RD).

D. Estimation Methods

The authors implemented and compared multiple estimation strategies on simulated data (designed to mirror the ECHILD database structure):

G-computation: Requires correct specification of the outcome model.
Inverse Probability Weighting (IPW): Requires correct specification of the propensity score (PS) model; useful for checking positivity violations.
Augmented IPW (AIPW): A doubly robust method requiring correct specification of either the outcome or the PS model.
Instrumental Variable (IV) / 2SLS: Used a simulated "Region" variable as an IV to bypass unmeasured confounding assumptions.
Standard Regression: Used as a baseline comparison (often biased in time-varying settings).

E. Simulation Strategy

Before analyzing real data, the team generated 10,000 simulated records with known "true" causal effects. This allowed them to:

Test the robustness of methods against model misspecification.
Validate code implementation in R and Stata.
Demonstrate the impact of violating assumptions (e.g., incorrect interaction terms).

3. Key Contributions

Framework Application: Demonstrates the practical application of the Target Trial Emulation framework to complex education-health linked data, showing how to iteratively refine questions based on data constraints.
Methodological Comparison: Provides a direct empirical comparison of G-computation, IPW, AIPW, and IV methods in the context of time-fixed and time-varying interventions.
Open Science: The authors released user-friendly code (R and Stata) and the simulated dataset on GitHub, enabling other researchers to replicate these causal analyses.
Handling Time-Varying Confounding: Explicitly demonstrates why standard regression fails when confounders are also mediators (collider bias) and how G-computation/IPW corrects this.

4. Key Results

From Simulated Data Analysis:

Model Specification is Critical:
- G-computation: Highly sensitive to model specification. "Incorrect" versions (lacking interactions) produced estimates far from the truth. Only "very general" models yielded accurate results.
- IPW: Sensitive to the Propensity Score model. It effectively highlighted violations of the positivity assumption (rare covariate-exposure combinations).
- AIPW: Proved to be the most robust approach, providing accurate estimates even when one of the models (outcome or PS) was misspecified.
- IV (2SLS): While unbiased, estimates were highly imprecise (wide confidence intervals), even with a strong instrument.
Time-Varying Effects (Q3):
- Standard regression adjustment for time-varying confounders resulted in biased estimates (blocking indirect effects and introducing collider bias).
- G-computation and IPW successfully recovered the true causal effect for sustained interventions.
- IPW was preferred over G-computation for sustained interventions because it relies on fewer modeling assumptions (only PS models needed, not complex outcome models for every time point), despite being slightly less precise.
Short-term vs. Long-term: Short-term effects of SEND provision were found to be stronger (reducing UA rates to ~~20% of baseline) than long-term effects (~~40% reduction), highlighting the necessity of distinguishing between Q1 and Q2.

From Real Data (HOPE Study):

The study successfully narrowed the scope to specific clinical populations where causal inference was feasible.
The analysis confirmed that SEND provision is associated with reduced unauthorized absences, but the magnitude depends heavily on the estimation method and the specific causal question (short vs. long term).

5. Significance and Recommendations

Iterative Question Formulation: Researchers using administrative data must treat causal question formulation as an iterative process, refining the question based on data availability and the ability to emulate a target trial.
Sensitivity via Simulation: The paper strongly advocates for using simulated data to practice implementation and test assumptions before analyzing real-world data.
Method Selection:
- For time-fixed interventions, AIPW is recommended for its double robustness.
- For time-varying (sustained) interventions, IPW is often preferable due to fewer modeling requirements, though G-computation is viable if models are correctly specified.
- Standard regression is insufficient for causal inference in the presence of time-varying confounding.
Transparency: The authors emphasize that unverifiable assumptions (like No Unmeasured Confounding) must be critically assessed, and results should be interpreted with caution, supported by sensitivity analyses using alternative methods.

In conclusion, this paper serves as a practical guide for researchers attempting to move from descriptive administrative data analysis to rigorous causal inference, emphasizing the necessity of precise question definition, robust estimation methods, and the use of simulation for validation.

Causal analyses using education-health linked data for England: a case study