Estimation of heterogeneous principal effects under principal ignorability

Imagine you are a doctor trying to figure out if a new medicine works. You run a big test: half the patients get the medicine, half get a sugar pill. At the end, you look at the average result. "Hmm," you say, "The medicine didn't help anyone overall."

But then, a detective in your team whispers, "Wait a minute. What if the medicine did work, but only for the people who actually took it every day? What if the people who skipped doses or stopped taking it were the ones dragging down the average?"

This is the core problem this paper tackles. It's about finding the truth hidden inside the "average."

The Problem: The "Ghost" Groups

In medical trials (and many other experiments), people fall into invisible groups based on how they would react to the treatment, not just how they actually did.

The Compliers: People who take the medicine if assigned to it, and don't if they aren't.
The Never-Takers: People who refuse the medicine no matter what.
The Always-Takers: People who take the medicine even if they were assigned the sugar pill.

The tricky part? You can't see these groups directly. You only see who actually took the pill. If you try to compare "people who took the pill" vs. "people who didn't," you get a biased result because the "pill-takers" might be healthier or more motivated to begin with.

The Goal: Finding the "Heterogeneous" Truth

The authors want to know: Does the medicine work differently for different types of "Compliers"?
Maybe it works wonders for young women with high blood pressure but does nothing for older men with diabetes. If you just look at the average effect for all compliers, you might miss these crucial details. This is called Heterogeneity.

The Solution: A New Toolkit for "Principal Ignorability"

The paper proposes a new set of mathematical tools (estimators) to uncover these hidden, specific effects. They rely on an assumption called Principal Ignorability.

The Analogy: The Detective's Notebook
Imagine you are a detective trying to solve a crime. You know the suspect (the treatment) and the victim (the outcome), but you don't know exactly who was in the room (the principal stratum).

Old Method (The T-Learner): You try to guess the suspect's motive by looking at two separate lists: "People who were in the room" and "People who weren't." If your lists are messy or incomplete, your guess is wrong.
The New Method (The Paper's Approach): The authors built three new "detective kits" that are much smarter and more forgiving of messy data.

1. The "Subset" Kit (The Double-Check)

This method looks at a specific slice of the data (e.g., only people who actually took the pill).

The Superpower: It's Doubly Robust. Imagine you are trying to guess a person's height. You can use a ruler (Model A) or a shadow measurement (Model B). If either your ruler is perfect OR your shadow math is perfect, you get the right answer. You don't need both to be perfect. This makes it very reliable.

2. The "EIF" Kit (The All-Seeing Eye)

This method tries to use every single piece of data at once using a complex formula called the "Efficient Influence Function."

The Superpower: It's Multiply Robust. It has three paths to the truth. If your "ruler" is bad, but your "shadow math" and a third "wind measurement" are good, you still get the right answer.
The Catch: It's like a high-performance race car. It's theoretically the fastest, but if the road is bumpy (small data or messy numbers), it can crash. It's very sensitive to small errors.

3. The "One-Step" Kit (The Best of Both Worlds)

This is the paper's star invention. It starts with a simple guess (like the old "T-Learner") and then uses the "All-Seeing Eye" formula to fix the mistakes.

The Superpower: It gets the reliability of the complex method but stays stable like the simple method. It's like taking a rough sketch and using a magic eraser to clean up the lines. It's robust, stable, and works well even with smaller datasets.

The Real-World Test: The "Hotspotting" Trial

To prove their tools work, the authors applied them to a real medical study called "Hotspotting."

The Setup: A program tried to help "super-utilizers" (people who use the ER constantly) by giving them a care manager.
The Mystery: The overall study said the program did nothing. But a secondary analysis showed it helped the people who actually engaged with the care manager.
The Question: Did the program help everyone who engaged, or only specific types of people who engaged?

The Result:
Using their new "One-Step" and "Subset" tools, the authors found:

The program did help, but only for a specific group of "compliers."
It worked best for women and people with a long history of hospital visits.
It didn't seem to work for men or those with shorter hospital stays.

Why This Matters

Before this paper, if you wanted to know who benefits from a treatment, you often had to make strong, unprovable guesses or use methods that broke easily with messy data.

This paper gives researchers a Swiss Army Knife:

A simple, sturdy tool (Subset) that works well if you have decent data.
A powerful, complex tool (EIF) for when you have massive data.
A hybrid tool (One-Step) that is the best all-rounder, giving you the power of the complex tool without the crash risk.

In short: They figured out how to stop looking at the "average" patient and start seeing the specific, hidden patterns of who actually benefits from an intervention, helping doctors and policymakers target their help to the people who need it most.

Here is a detailed technical summary of the paper "Estimation of heterogeneous principal effects under principal ignorability" by Zhang, Doss, and Huling.

1. Problem Statement

The paper addresses the challenge of estimating Conditional Principal Causal Effects (CPCEs) in settings with binary treatments and binary intermediate variables (post-treatment mediators).

Context: In causal inference, "Principal Stratification" (Frangakis & Rubin, 2002) defines subgroups (strata) based on potential values of an intermediate variable $S$ (e.g., $S(1)$ and $S(0)$ ). Common strata include compliers ( $S(1)=1, S(0)=0$ ), never-takers ( $S(1)=0, S(0)=0$ ), and always-takers ( $S(1)=1, S(0)=1$ ).
The Gap: While average principal causal effects (PCEs) are well-studied, estimating how these effects vary with baseline covariates $X$ (heterogeneity) within these latent strata is difficult.
Identification Challenge: Standard Instrumental Variable (IV) approaches often rely on the Exclusion Restriction (ER), which assumes the treatment has no direct effect on the outcome for non-compliers. The authors argue ER is often implausible (e.g., in the Camden Coalition "hotspotting" trial where patients knew their assignment).
Goal: Develop a framework to estimate $\tau^u(x) = E[Y(1) - Y(0) \mid U=u, X=x]$ for $u \in \{00, 10, 11\}$ under the Principal Ignorability (PI) assumption, which posits that conditional on covariates $X$ , potential outcomes are independent of principal stratum membership.

2. Methodology

The authors propose a framework involving four estimators, ranging from a baseline approach to three robust, machine-learning-compatible methods.

A. Identification Strategy

Under PI, Monotonicity, and Consistency, the CPCE for a stratum $u$ can be identified as a difference in observed conditional mean outcomes.

Theorem 2.1: Shows that $\tau^u(x)$ $τ^{u} (x)$ can be expressed as contrasts of observed strata means $\mu_{zs}(x) = E[Y \mid Z=z, S=s, X=x]$ $μ_{z s} (x) = E [Y ∣ Z = z, S = s, X = x]$ .
- Example: $\tau^{10}(x) = \mu_{11}(x) - \mu_{00}(x)$ .

B. Estimators

The paper develops four estimators:

T-Learner (Baseline):
- Estimates $\mu_{zs}(x)$ separately for each observed group and takes the difference.
- Limitation: Sensitive to model misspecification and group imbalance; errors in either regression model directly bias the result.
Subset Estimator (Doubly Robust):
- Adapts the DR-Learner (Kennedy, 2023) to specific observed subsets of the data (e.g., for compliers, using only units where $Z=S$ ).
- Constructs a "pseudo-outcome" $\phi_{\tau^u}$ within the subset.
- Robustness: Doubly robust. Consistent if either the subset propensity score $\pi_{S_u}(x)$ OR the outcome regression $\mu_{zs}(x)$ is correctly specified.
EIF Estimator (Multiply Robust):
- Uses the Efficient Influence Function (EIF) derived from Jiang et al. (2022) for the full dataset.
- Involves a ratio of nuisance functions (numerator and denominator).
- Robustness: Multiply robust. Consistent if (a) propensity scores and principal scores are correct, OR (b) outcome regressions are correct.
- Limitation: Numerically unstable in finite samples due to the ratio structure (division by estimated principal scores) and requires three-way sample splitting.
One-Step Estimator (Multiply Robust):
- A refinement of the T-Learner (or any preliminary estimator $\check{\tau}^u$ ) using an influence-function based correction.
- Formula: $\hat{\tau}^u_{one}(x) = \text{Regression}\left[ \check{\tau}^u(X) + \frac{\phi_{1,u} - \phi_{0,u} - \check{\tau}^u(X)g_u(X)}{e_u(X)} \mid X \right]$ .
- Robustness: Inherits the multiply robust properties of the EIF estimator (consistent if outcome models are correct OR if propensity/principal scores are correct).
- Advantage: More stable than the EIF estimator as it avoids direct ratio estimation in the first step and uses a single second-stage regression.

C. Theoretical Properties

Smoothness Conditions: The authors establish large-sample theory under nonparametric smoothness assumptions.
Error Bounds: They derive pointwise error bounds showing that the Subset and One-Step estimators achieve optimal convergence rates if the nuisance functions are estimated at sufficient rates.
Robustness Hierarchy:
- Subset: Double Robustness (2 paths to consistency).
- One-Step/EIF: "Intermediate" Robustness (often called multiply robust in this context). They require either the outcome model to be correct, or both the propensity and principal score models to be correct. This is stronger than standard triple robustness (where any 2 of 3 suffice) but weaker than double robustness in terms of the number of valid paths, yet sufficient for practical inference.

3. Key Contributions

Framework for Heterogeneous PCEs: Provides the first general framework for estimating conditional principal causal effects under Principal Ignorability, moving beyond average effects.
Novel Estimators: Introduces the Subset and One-Step estimators specifically tailored for principal stratification.
Robustness Characterization: Clarifies the robustness landscape. While Jiang et al. (2022) established triple robustness for marginal PCEs, this paper shows that for conditional PCEs, the EIF and One-Step estimators possess a specific "multiply robust" structure that is distinct from standard double robustness.
Finite Sample Stability: Demonstrates that the One-Step estimator offers a practical compromise, retaining the robustness of the EIF approach while avoiding its numerical instability in small samples.

4. Results

Simulation Studies

Parametric Misspecification: In scenarios where nuisance models are misspecified, the T-Learner fails (inconsistent), while the Subset, One-Step, and EIF estimators remain consistent (validating their robustness properties).
Flexible ML (GAMs): When using Generalized Additive Models (GAMs) for flexible estimation:
- The Subset and One-Step estimators consistently achieved the lowest Root Mean Squared Error (RMSE).
- The EIF estimator showed high variance and instability in small samples ( $n=1000$ ) but improved with larger samples.
- The One-Step estimator was found to be more stable than the Subset estimator when observed subsets were highly unbalanced (e.g., few compliers in one arm).

Real Data Application: Camden Coalition Hotspotting Trial

Context: A randomized trial of a care management intervention for high-utilizing patients. Previous analyses showed null average effects but significant benefits for "high engagers."
Analysis: The authors estimated CPCEs for compliers (patients who engaged if assigned to treatment).
Findings:
- Heterogeneity: Significant heterogeneity was found among compliers. While the average effect was modest, individual effects varied widely.
- Drivers: The intervention was most effective for female compliers and those with higher prior inpatient admissions.
- Education: Contrary to previous engagement studies, education level did not significantly drive the treatment effect heterogeneity once compliance was established.
- Conclusion: The benefits were not just due to who engaged (selection), but genuine heterogeneity in how the intervention worked for different subgroups of compliers.

5. Significance

Policy Relevance: This work helps decision-makers distinguish between two sources of subgroup differences: (1) heterogeneity in who engages with an intervention, and (2) heterogeneity in the causal effect of the intervention itself. This distinction is crucial for refining interventions (e.g., should we target outreach or improve the program design?).
Methodological Advancement: By relaxing the Exclusion Restriction in favor of Principal Ignorability and providing robust, machine-learning-compatible estimators, the paper enables more credible causal inference in complex observational and RCT settings where intermediate variables are present.
Practical Utility: The recommendation of the One-Step estimator provides a robust, stable tool for practitioners who need to estimate heterogeneous effects in principal strata without the numerical instability of pure EIF approaches.