Prediction decomposition for causal analysis

✨

This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Problem: "The Good Predictor is a Bad Detective"

Imagine you are a detective trying to solve a mystery: Did a new fertilizer make crops grow taller?

You have a huge field of corn, but you can't measure every single plant (it's too expensive). So, you measure a small sample of plants directly. Then, you train a super-smart AI (Machine Learning) to look at satellite photos and guess the height of the other plants based on the ones you measured.

The Trap:
The AI becomes a fantastic predictor of how tall a plant is. It looks at the soil, the weather history, and the location, and it says, "This plant is 5 feet tall, and that one is 4 feet tall." It gets an A+ on its test scores.

But here is the catch: The AI is terrible at being a detective.

If you give the AI the data to see if the fertilizer worked, it might say, "No change!" Why? Because the AI learned that plants in "Rich Soil" are naturally tall, and plants in "Poor Soil" are naturally short. It learned to predict static differences (who is tall vs. who is short). It didn't learn to predict dynamic changes (what happens when you add fertilizer).

So, even though the AI is great at guessing heights, it completely misses the effect of the fertilizer.

The Paper's Solution: Breaking the Prediction into Three Parts

The author, Ofir Reich, proposes a way to break down the AI's "brain" into three distinct ingredients. Think of the AI's prediction as a smoothie made of three fruits:

The "Who You Are" Fruit (Between-Unit): This is the static stuff. Where you live, your family history, your soil type. The AI is usually very good at this. It knows that a person in a wealthy neighborhood spends more money than someone in a poor neighborhood.
The "Natural Drift" Fruit (Within-Unit): This is how things change naturally over time. Maybe you spent more money this month because it's your birthday, or the corn grew a bit because of the rain. This is the "noise" of life.
The "Magic Effect" Fruit (Counterfactual Treatment): This is the specific change caused by the intervention (the fertilizer, the cash transfer, the medicine).

The Discovery:
The paper argues that overall accuracy (the whole smoothie) is a bad way to judge if the AI will work for your experiment.

An AI can be 99% accurate at predicting the smoothie's taste because it nailed the "Who You Are" fruit.
But if it has zero of the "Magic Effect" fruit, it will fail your experiment.

The Secret Ingredient: The "Time-Travel" Test

How do we know if the AI has the "Magic Effect" fruit without actually running the experiment on the whole population?

The author suggests a clever trick using Panel Data (data from the same people/plots at two different times, like before and after).

The Analogy: The "Before and After" Mirror
Imagine you are trying to teach a robot to understand how a car accelerates when you press the gas pedal.

Bad Approach: You show the robot a picture of a Ferrari and a picture of a Tractor. The robot learns: "Ferraris are fast, Tractors are slow." It gets the prediction right! But if you ask, "What happens if I press the gas on the Tractor?" it has no idea, because it only learned about the difference between the cars, not the action of the gas pedal.
The Paper's Approach: You show the robot the same Tractor at 1:00 PM and then at 1:05 PM.
- If the robot predicts a different speed at 1:05 PM than at 1:00 PM, it must be paying attention to things that change over time (like the engine revving after you press the gas).
- If the robot gives the same prediction both times — just saying "It's a Tractor, it's slow" — it fails the test, because it's only paying attention to what the vehicle is, not what's happening to it.

The Metric (The "Diff-vs-Diff" Slope):
The paper proposes a specific test:

Take the people who did not get the treatment (the control group).
Look at how their actual outcomes changed from Time 1 to Time 2.
Look at how the AI's predicted outcomes changed from Time 1 to Time 2.
The Test: Do the AI's predictions move in sync with the real changes?

If the AI predicts the changes well: It means the AI is sensitive to the "drift" of life. The author argues this is a strong sign it will also be sensitive to the "Magic Effect" (the treatment).
If the AI ignores the changes: It means the AI is just memorizing who is rich and who is poor. It will fail to detect the treatment effect.

Why This Matters

In the past, researchers picked the "best" AI based on who had the highest R-Squared (overall accuracy).

The Paper says: Stop doing that! A high R-Squared often just means the AI is good at spotting "Rich vs. Poor" (Between-Unit).
The New Rule: Pick the AI that is best at predicting changes over time (Within-Unit), even if its overall accuracy is slightly lower.

The "Magic" Fix (With a Warning)

If you find an AI that is good at predicting changes (high "Within-Unit" score), the paper suggests you can use it to fix your results.

If the AI is only 80% good at detecting changes, you can mathematically "stretch" your results to get the true answer.
The Warning: This only works if you assume that "predicting natural changes" is very similar to "predicting treatment changes." The author thinks this is usually true (e.g., if a model knows how a person's spending changes when they get a bonus, it likely knows how it changes when they get a cash transfer), but it's an assumption that needs to be checked.

Summary in One Sentence

Don't just ask your AI, "How well can you guess the answer?" Ask it, "How well can you guess the change?" Because if it can't guess the change, it definitely can't guess the effect of your experiment.

1. Problem Statement

There is a growing trend in economics and social sciences to use Machine Learning (ML) model predictions as outcomes in causal analysis (e.g., estimating the effect of cash transfers on consumption using mobile phone data, or agricultural interventions on yields using remote sensing). This approach is motivated by the high cost of collecting ground-truth outcome data for large samples; ML allows researchers to generate predictions for the entire population cheaply.

However, a critical challenge exists: High prediction accuracy does not guarantee accurate causal inference.

The Core Issue: ML models are optimized to minimize prediction error (fitting observed outcomes), whereas causal inference requires capturing counterfactual variation (the difference between treated and untreated states).
The Failure Mode: A model might achieve a high $R^2$ by perfectly capturing stable, between-unit differences (e.g., wealthier areas have higher consumption) while being completely insensitive to treatment-induced changes (e.g., a cash transfer does not change a household's location). Consequently, the estimated treatment effect using ML predictions can be biased toward zero (attenuated) or entirely incorrect, even if the model predicts the absolute levels well.
Current Limitations: Existing model selection criteria (like standard $R^2$ or Mean Squared Error) fail to distinguish between models that capture treatment effects and those that only capture static unit characteristics.

2. Methodology: Prediction Decomposition

The author proposes a theoretical framework to decompose the ML prediction error into three distinct components. This decomposition explains why standard metrics fail and introduces a new diagnostic metric.

2.1 The Decomposition Model

The actual outcome for unit $i$ at time $t$ is modeled as:
$Y_{i,t}^{actual} = \alpha + \mu_i + \gamma T_{i,t} + \epsilon_{i,t}$
Where:

$\mu_i$ : Time-invariant unit fixed effects (between-unit variation).
$T_{i,t}$ : Treatment indicator.
$\gamma$ : The true treatment effect.
$\epsilon_{i,t}$ : Time-varying error (within-unit variation).

The ML-predicted outcome is decomposed as:
$Y_{i,t}^{pred} = \alpha + \eta_\mu \mu_i + \eta_T \gamma T_{i,t} + \eta_\epsilon \epsilon_{i,t} + \nu_{i,t}$

The coefficients represent the model's ability to capture specific variations:

$\eta_\mu$ (Between-unit fit): How well the model captures static differences between units.
$\eta_\epsilon$ (Within-unit fit): How well the model captures natural temporal changes within a unit (unrelated to treatment).
$\eta_T$ (Counterfactual fit): How well the model captures the causal treatment effect.
$\nu_{i,t}$ : Residual noise.

Key Theoretical Insight:

The estimated treatment effect from a regression using predicted outcomes is $\hat{\gamma}_{pred} = \eta_T \cdot \gamma$ .
Standard prediction accuracy ( $R^2$ ) is heavily influenced by $\eta_\mu$ (which is often large in real data) but is largely insensitive to $\eta_T$ .
Hypothesis: $\eta_\epsilon$ is a structurally better proxy for $\eta_T$ than overall accuracy. Both rely on dynamic, time-varying signals, whereas $\eta_\mu$ relies on static features that do not respond to treatment.

2.2 The Proposed Metric: Diff-vs-Diff Slope

To estimate $\eta_\epsilon$ without experimental data (which is required to estimate $\eta_T$ directly), the author proposes a metric using panel data (at least two time periods) from a non-treated (control) subsample.

Calculate Differences: Compute the change in actual and predicted outcomes for each unit between time $t=1$ and $t=2$ : $\Delta Y_i = Y_{i,2} - Y_{i,1}$ .
Regression: Regress the change in predicted outcomes on the change in actual outcomes (without an intercept) for the control group:
$\Delta Y_i^{pred} = \beta \cdot \Delta Y_i^{actual} + \text{error}$
Estimation: The estimated slope $\hat{\beta}$ $\hat{β}$ is the estimator for $\eta_\epsilon$ .
- Derivation: In the absence of treatment ( $T=0$ ), the covariance between $\Delta Y^{pred}$ and $\Delta Y^{actual}$ is driven solely by $\eta_\epsilon$ and the variance of $\epsilon$ . The ratio of covariances yields $\eta_\epsilon$ .

3. Key Contributions

Theoretical Decomposition: Formalizes the failure of ML predictions in causal analysis by separating prediction performance into between-unit ( $\eta_\mu$ ), within-unit ( $\eta_\epsilon$ ), and treatment-effect ( $\eta_T$ ) components.
Diagnostic Metric: Introduces the Diff-vs-Diff slope ( $\hat{\eta}_\epsilon$ ) as a model selection tool. It allows researchers to select ML models that are sensitive to temporal dynamics (and thus likely sensitive to treatment) rather than just static features.
Bias Correction: Under the stronger assumption that $\eta_T \approx \eta_\epsilon$ , the metric enables the construction of an approximately unbiased treatment effect estimate:
$\text{Unbiased Effect} = \frac{\text{Estimated Effect}}{\hat{\eta}_\epsilon}$
Simulation Evidence: Demonstrates via synthetic data that:
- High $R^2$ does not correlate with accurate treatment effect recovery.
- $\eta_\mu$ drives $R^2$ , while $\eta_T$ drives causal accuracy.
- The Diff-vs-Diff metric successfully predicts the scaled treatment effect when $\eta_T = \eta_\epsilon$ .

4. Simulation Results

The author validates the framework using synthetic data simulating a Randomized Controlled Trial (RCT) with:

Data Structure: 2 time periods, 50% treatment rate, with between-unit variance ( $\mu$ ) dominating within-unit variance ( $\epsilon$ ) (ratio ~0.92:0.08).
Findings:
- Figure 1: There is no correlation between ML prediction $R^2$ and the accuracy of the estimated treatment effect.
- Figure 2 & 3: High $R^2$ is strongly driven by $\eta_\mu$ (fitting to person attributes), while $\eta_\epsilon$ has little impact on $R^2$ .
- Figure 4: Statistical power (t-statistic) for detecting treatment effects is determined by $\eta_T$ , not by prediction $R^2$ .
- Figure 6: When $\eta_T = \eta_\epsilon$ , the Diff-vs-Diff slope ( $\hat{\eta}_\epsilon$ ) accurately predicts the Scaled Treatment Effect.
- Figure 7: Without the assumption $\eta_T = \eta_\epsilon$ , there is no relationship between the metric and the treatment effect, highlighting the importance of the structural argument that $\eta_\epsilon$ is a proxy for $\eta_T$ .

5. Practical Guide for Researchers

The paper outlines a step-by-step workflow for practitioners:

Data Collection: Ensure a labeled subsample has ground-truth outcomes for at least two time periods (pre- and post-intervention).
Training: Train ML models only on untreated units to prevent contamination by treatment signals.
Metric Calculation: Compute the Diff-vs-Diff slope ( $\hat{\eta}_\epsilon$ ) on the control group.
Model Selection: Choose the model with the highest $\hat{\eta}_\epsilon$ , not the highest $R^2$ . A high $R^2$ with low $\hat{\eta}_\epsilon$ indicates the model is overfitting to static unit characteristics.
Bias Correction (Optional): If $\hat{\eta}_\epsilon < 1$ , the estimated treatment effect is attenuated. If the assumption $\eta_T \approx \eta_\epsilon$ holds, divide the estimated effect by $\hat{\eta}_\epsilon$ to correct for bias.

6. Significance and Limitations

Significance:

Provides a rigorous solution to the "surrogate endpoint" problem in ML-driven causal inference.
Offers a practical, low-cost diagnostic tool that avoids the need for expensive ground-truth data collection for the entire population.
Shifts the focus of model selection from "predicting levels" to "predicting changes," aligning ML objectives with causal goals.

Limitations & Assumptions:

Panel Data Requirement: The method requires at least two time periods of ground-truth data for a subsample. It cannot be applied to purely cross-sectional data.
The $\eta_T \approx \eta_\epsilon$ Assumption: While the structural argument (that dynamic features drive both natural change and treatment response) is strong, the exact equality is difficult to verify empirically without experimental variation. The metric is safest used as a diagnostic/model selector rather than a direct bias-correction factor unless the assumption is strongly justified.
Linearity: The decomposition assumes a linear-additive framework, though the author argues the coefficients can be interpreted as linear projections even in non-linear settings.

In conclusion, the paper argues that for causal analysis, prediction accuracy is a misleading metric. Researchers must prioritize models that capture within-unit temporal variation ( $\eta_\epsilon$ ), as this is the structural prerequisite for capturing counterfactual treatment effects ( $\eta_T$ ).