Causal generalized linear models via Pearson risk invariance

This paper introduces a method for identifying causal generalized linear models by leveraging Pearson risk invariance and maximum expected likelihood, enabling causal discovery from a single data environment without requiring multiple heterogeneous settings.

Alice Polinelli, Veronica Vinciotti, Ernst C. Wit

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you are a detective trying to figure out what really causes a specific event. Maybe you want to know why a plant grows tall, why a stock price crashes, or why a person has a certain number of children.

In the world of data science, there's a big problem: Correlation is not Causation. Just because two things happen at the same time doesn't mean one caused the other. They might just be "friends" who happen to hang out together, or they might both be caused by a third, hidden factor.

For years, scientists have tried to solve this by looking at data from many different "worlds" or "environments" (like different countries, different years, or different weather conditions). They look for patterns that stay the same no matter how the world changes. If a relationship holds up in a drought, a flood, and a sunny day, it's likely a true cause.

The Problem: Getting data from many different "worlds" is hard. Often, we only have one big dataset (one environment).

The Solution: This paper introduces a new detective tool called Causal Generalized Linear Models via Pearson Risk Invariance. It's a fancy name for a clever trick that lets you find the true causes using just one dataset, provided the data follows certain mathematical rules (like counting things or yes/no answers).

Here is how it works, explained with simple analogies:

1. The "Perfectly Balanced Scale" (Pearson Risk Invariance)

Imagine you are trying to predict how heavy a suitcase will be based on what's inside it.

  • The Wrong Way: You guess based on the color of the suitcase. Maybe red suitcases are usually heavy in your experience. But if you go to a different airport (a different environment), red suitcases might be light. Your prediction fails.
  • The Right Way: You look at the actual contents. If you know the contents, the weight is predictable.

The authors propose a specific way to measure "prediction error" called Pearson Risk. Think of this as a special scale.

  • If you use the wrong variables (like the color of the suitcase), the scale will wobble. The error will be too high or too low depending on the situation.
  • If you use the right variables (the true causes), the scale becomes perfectly balanced. The "wobble" (the error) matches a specific, known standard perfectly, no matter how you shuffle the data around.

The paper proves that only the true causal model makes this scale perfectly balanced. All other models (even very good predictive ones) will make the scale wobble.

2. The "Goldilocks" Search (Maximizing Likelihood)

Finding the right variables is like finding the perfect key for a lock.

  • The method first looks for keys that fit the lock well (maximizing the "likelihood," or how well the model explains the data we have).
  • Then, it checks if that key makes the "Pearson Scale" perfectly balanced.
  • If a key fits the data and balances the scale, Bingo! You found a causal parent.

3. The "One-Environment" Magic Trick

Usually, to prove something is a cause, you need to see it change in many different environments.

  • The Old Way: You need data from 10 different cities to prove that rain causes wet grass.
  • The New Way: If you are counting things (like the number of emails you get, which follows a Poisson distribution) or dealing with Yes/No outcomes (like Logistic regression), the math is so strict that the "Perfectly Balanced Scale" only works for the true causes.
  • The Result: You don't need 10 cities. You can find the true causes with data from just one city, as long as you know the "rules of the game" (the dispersion parameter).

4. The "Stepwise Detective" (The Algorithm)

Imagine you have 100 suspects (variables). Checking every possible combination of suspects to see who is guilty would take a lifetime (checking $2^{100}$ combinations).

  • The authors propose a Stepwise Search.
  • Phase 1 (Adding): Start with an empty room. Add one suspect at a time. If adding a suspect makes the "Pearson Scale" wobble less (or stay balanced), keep them.
  • Phase 2 (Removing): Once you have a group, try removing one suspect at a time. If removing them makes the scale wobble, put them back. If the scale stays balanced without them, kick them out.
  • This is much faster than checking every single combination, making it practical for real-world problems.

Real-World Examples from the Paper

The authors tested this on two real-life mysteries:

  1. Women's Fertility: They looked at data on how many children women have.

    • The Result: They found that education level and age have a direct causal effect. Interestingly, the effect of education wasn't a straight line; it was a curve. As education goes up, fertility drops sharply. This method found the shape of that relationship, not just a simple "more education = fewer kids."
  2. High Income: They looked at what causes people to earn over $50,000 a year.

    • The Result: They identified age, education, marital status, and job type as the true drivers. For example, being married made someone roughly 7 times more likely to be a high earner compared to other statuses.

The Bottom Line

This paper gives us a new, powerful magnifying glass. It allows us to separate true causes from lucky coincidences using just a single dataset, provided the data is the right type (counts or yes/no).

Instead of needing a time machine to see how things change in different worlds, this method uses a mathematical "balance scale" to tell us which variables are the real architects of our reality. It's like finding the true recipe for a cake by tasting just one slice, rather than baking the cake in 50 different kitchens.