Causal generalized linear models via Pearson risk invariance

Imagine you are a detective trying to figure out what really causes a specific event. Maybe you want to know why a plant grows tall, why a stock price crashes, or why a person has a certain number of children.

In the world of data science, there's a big problem: Correlation is not Causation. Just because two things happen at the same time doesn't mean one caused the other. They might just be "friends" who happen to hang out together, or they might both be caused by a third, hidden factor.

For years, scientists have tried to solve this by looking at data from many different "worlds" or "environments" (like different countries, different years, or different weather conditions). They look for patterns that stay the same no matter how the world changes. If a relationship holds up in a drought, a flood, and a sunny day, it's likely a true cause.

The Problem: Getting data from many different "worlds" is hard. Often, we only have one big dataset (one environment).

The Solution: This paper introduces a new detective tool called Causal Generalized Linear Models via Pearson Risk Invariance. It's a fancy name for a clever trick that lets you find the true causes using just one dataset, provided the data follows certain mathematical rules (like counting things or yes/no answers).

Here is how it works, explained with simple analogies:

1. The "Perfectly Balanced Scale" (Pearson Risk Invariance)

Imagine you are trying to predict how heavy a suitcase will be based on what's inside it.

The Wrong Way: You guess based on the color of the suitcase. Maybe red suitcases are usually heavy in your experience. But if you go to a different airport (a different environment), red suitcases might be light. Your prediction fails.
The Right Way: You look at the actual contents. If you know the contents, the weight is predictable.

The authors propose a specific way to measure "prediction error" called Pearson Risk. Think of this as a special scale.

If you use the wrong variables (like the color of the suitcase), the scale will wobble. The error will be too high or too low depending on the situation.
If you use the right variables (the true causes), the scale becomes perfectly balanced. The "wobble" (the error) matches a specific, known standard perfectly, no matter how you shuffle the data around.

The paper proves that only the true causal model makes this scale perfectly balanced. All other models (even very good predictive ones) will make the scale wobble.

2. The "Goldilocks" Search (Maximizing Likelihood)

Finding the right variables is like finding the perfect key for a lock.

The method first looks for keys that fit the lock well (maximizing the "likelihood," or how well the model explains the data we have).
Then, it checks if that key makes the "Pearson Scale" perfectly balanced.
If a key fits the data and balances the scale, Bingo! You found a causal parent.

3. The "One-Environment" Magic Trick

Usually, to prove something is a cause, you need to see it change in many different environments.

The Old Way: You need data from 10 different cities to prove that rain causes wet grass.
The New Way: If you are counting things (like the number of emails you get, which follows a Poisson distribution) or dealing with Yes/No outcomes (like Logistic regression), the math is so strict that the "Perfectly Balanced Scale" only works for the true causes.
The Result: You don't need 10 cities. You can find the true causes with data from just one city, as long as you know the "rules of the game" (the dispersion parameter).

4. The "Stepwise Detective" (The Algorithm)

Imagine you have 100 suspects (variables). Checking every possible combination of suspects to see who is guilty would take a lifetime (checking $2^{100}$ combinations).

The authors propose a Stepwise Search.
Phase 1 (Adding): Start with an empty room. Add one suspect at a time. If adding a suspect makes the "Pearson Scale" wobble less (or stay balanced), keep them.
Phase 2 (Removing): Once you have a group, try removing one suspect at a time. If removing them makes the scale wobble, put them back. If the scale stays balanced without them, kick them out.
This is much faster than checking every single combination, making it practical for real-world problems.

Real-World Examples from the Paper

The authors tested this on two real-life mysteries:

Women's Fertility: They looked at data on how many children women have.
- The Result: They found that education level and age have a direct causal effect. Interestingly, the effect of education wasn't a straight line; it was a curve. As education goes up, fertility drops sharply. This method found the shape of that relationship, not just a simple "more education = fewer kids."
High Income: They looked at what causes people to earn over $50,000 a year.
- The Result: They identified age, education, marital status, and job type as the true drivers. For example, being married made someone roughly 7 times more likely to be a high earner compared to other statuses.

The Bottom Line

This paper gives us a new, powerful magnifying glass. It allows us to separate true causes from lucky coincidences using just a single dataset, provided the data is the right type (counts or yes/no).

Instead of needing a time machine to see how things change in different worlds, this method uses a mathematical "balance scale" to tell us which variables are the real architects of our reality. It's like finding the true recipe for a cake by tasting just one slice, rather than baking the cake in 50 different kitchens.

Here is a detailed technical summary of the paper "Causal generalized linear models via Pearson risk invariance" by Polinelli, Vinciotti, and Wit.

1. Problem Statement

Causal discovery aims to identify the direct causes (parents) of a target variable. A dominant paradigm in this field is Invariant Causal Prediction (ICP), which relies on the principle that the conditional distribution of a target variable given its causal parents remains invariant across different environments (e.g., observational vs. interventional settings).

However, existing ICP methods face significant limitations:

Data Requirements: They typically require observational data from multiple, sufficiently different environments to detect invariance. Such data is often unavailable in real-world scenarios.
Model Constraints: Most methods are restricted to linear structural equation models (SEMs) with Gaussian errors. Extensions to non-linear or non-Gaussian settings (e.g., count or binary data) are rare and often still require multiple environments.
Computational Cost: Searching for invariance across all possible subsets of variables is combinatorially expensive ($2^p$).

This paper addresses the challenge of performing causal discovery for Generalized Linear Models (GLMs) with exponential family targets (e.g., Poisson, Binomial) using data from a single environment, provided the dispersion parameter is known.

2. Methodology

The authors propose a novel framework based on two key theoretical properties that uniquely characterize the causal model within the class of GLMs.

2.1 Structural Model Definition

The target variable $Y$ is modeled conditionally on its parents $X_{PA}$ using a Generalized Linear Model from the Exponential Dispersion Family (EDF):
$Y | X_{PA} = x_{PA} \sim \text{EDF}\left(b(f_{PA}(x_{PA})), a(\phi)\right)$
where $f_{PA}$ is a potentially non-linear link function, and $a(\phi)$ is the dispersion parameter. Crucially, no assumptions are made about the distributions of the non-parent variables.

2.2 Characterization of the Causal Model

The paper establishes two necessary and sufficient conditions for a model to be the true causal model:

Maximum Expected Likelihood: The causal link function $f_{PA}$ maximizes the expected log-likelihood of the marginal model $(X_{PA}, Y)$ .
Pearson Risk Invariance: The expected Pearson risk (squared Pearson residuals) under the causal model is invariant to the distribution of the covariates $X$ . Specifically, for the true causal model, the expected Pearson risk equals the dispersion parameter $a(\phi)$ :
$E_{X,Y}\left[ \frac{(Y - \dot{b}(f_{PA}(X)))^2}{\ddot{b}(f_{PA}(X))} \right] = a(\phi)$
Here, $\dot{b}$ and $\ddot{b}$ are the first and second derivatives of the cumulant generator. This property holds for any distribution of $X$ (observational or interventional).

Key Insight: Unlike Gaussian models where the residual variance is constant, GLMs exhibit heteroscedasticity. The Pearson risk accounts for this by normalizing the squared residuals by the conditional variance. The authors prove that while many models might satisfy the likelihood maximization, only the true causal model (and models containing d-separated variables) satisfies the Pearson risk invariance condition with the known dispersion parameter.

2.3 Computational Algorithms

The authors propose two algorithms to search for the causal parents:

Full Empirical Algorithm (Algorithm 2):
1. For every subset of covariates $S$ , fit a penalized GLM.
2. Test the null hypothesis $H_0$ : The Pearson risk of the fitted model equals the known dispersion parameter $a(\phi)$ .
3. Retain subsets where $H_0$ cannot be rejected.
4. Select the sparsest model among candidates using the Bayesian Information Criterion (BIC).
- Note: For Poisson regression, the test statistic asymptotically follows a $\chi^2$ distribution, avoiding the need for bootstrapping. For other cases, bootstrapping is used.
Stepwise Greedy Search (Algorithm 3):
To handle high-dimensional data ( $p$ is large), a stepwise approach is proposed:
1. Forward Selection: Start with an intercept. Iteratively add variables that maintain the "perfect dispersion" property (Pearson risk $\approx a(\phi)$ ) until no further variables can be added without rejecting the null.
2. Backward Elimination: Remove variables from the selected model that do not improve the BIC, ensuring sparsity.

3. Key Contributions

Single-Environment Causal Discovery for GLMs: The paper demonstrates that for GLMs with a known dispersion parameter (e.g., Poisson, Logistic), the causal model can be identified from a single dataset. This is a major breakthrough compared to existing ICP methods that require multiple environments.
Pearson Risk Invariance: The authors introduce the invariance of the expected Pearson risk as a unique identifier for causal parents in non-Gaussian settings, extending the concept of invariance beyond linear Gaussian models.
Non-Linear Flexibility: The method accommodates non-linear effects via Generalized Additive Models (GAMs) without assuming linearity in the link function.
Computational Efficiency: The introduction of a stepwise search strategy significantly reduces computational complexity compared to exhaustive search, making the method scalable.
Implementation: The method is implemented in the R package causalreg.

4. Results

Simulation Studies

Proof of Concept: In a population setting with Poisson data, the causal model (true parents) was shown to lie on the surface where Pearson risk = 1, while non-causal models (even those with high predictive power on observational data) did not.
Finite Sample Performance (Poisson & Logistic):
- The method successfully identified the true causal parents in Poisson and Logistic regression settings with high accuracy (e.g., 91% for Poisson with $n=1000$ ).
- It outperformed the PC-algorithm (a standard causal discovery method) in these non-Gaussian settings.
- The stepwise algorithm achieved near-exhaustive search accuracy with significantly lower computational cost (approx. 5x faster).

Empirical Applications

Controlled Experiment (Light Tunnel): Applied to a physical experiment dataset. The method correctly identified light source colors and sensor brightness as causal drivers of infrared intensity, validating the approach against a known ground truth.
Women's Fertility (GSS Data): Identified causal determinants of fertility (e.g., education, age, race). The model captured non-linear effects (e.g., the sharp drop in fertility with higher education) that linear models missed.
High Income (Census Data): Identified causal drivers of high income (> $50k), including age, education, marital status, and occupation. The results aligned with economic theory (e.g., positive effect of education and white-collar jobs).

5. Significance and Conclusion

This paper provides a robust, semi-parametric framework for causal discovery in settings where data is limited to a single environment and the response variable is non-Gaussian (counts, binary, etc.).

Theoretical Impact: It bridges the gap between invariant causal prediction and generalized linear modeling, proving that the Pearson risk invariance is a sufficient condition for causal identification when the dispersion is known.
Practical Impact: By removing the requirement for multiple environments, the method makes causal inference accessible for many standard statistical problems (e.g., epidemiology, economics) where only observational data is available.
Future Directions: The authors note that for distributions with unknown dispersion (e.g., Negative Binomial), multiple environments may still be needed to estimate the dispersion parameter, or the method can be adapted to estimate it jointly. Future work aims to improve computational efficiency through better approximations of test statistics.

In summary, the paper offers a powerful tool for researchers to move from mere association to causal inference in complex, non-linear, and non-Gaussian systems using standard single-dataset observational studies.