Verifying the existence of maximum likelihood estimates for generalized linear models

Imagine you are a detective trying to solve a mystery using a very sophisticated computer program. Your goal is to find the "perfect recipe" (a set of numbers) that explains a pattern in your data. In the world of economics, this is called finding the Maximum Likelihood Estimate (MLE). It's like finding the exact combination of ingredients that makes a cake taste exactly like the one you're trying to copy.

For a long time, economists thought this computer program would always find a perfect recipe. But this paper reveals a hidden trap: sometimes, the perfect recipe doesn't exist.

Here is the breakdown of the problem and the solution, using simple analogies.

1. The Problem: The "Impossible Cake" (Separation)

Imagine you are trying to predict whether a customer will buy a product (Yes/No) or how much they will spend (Count data). You have a list of clues (variables) like age, income, and location.

The Trap:
Sometimes, your clues are too perfect. Imagine you have a rule: "If a customer is from Country A, they never buy anything."

In your data, every single person from Country A has a purchase count of zero.
Every single person from Country B has a purchase count of one or more.

When you ask the computer to find the "perfect recipe," it gets confused. To make the prediction for Country A perfectly accurate (zero), the computer tries to make the "Country A" ingredient in the recipe infinitely negative. To make the prediction for Country B accurate, it tries to make that ingredient infinitely positive.

The computer keeps running in circles, trying to find a number that is "infinity." It never stops. In math terms, the estimate does not exist. This is called Separation. It's like trying to balance a pencil on its tip; no matter how hard you try, it falls over because the perfect balance point is physically impossible to hold.

Why is this a big deal?

It's common: It happens often in trade data (e.g., two countries that have never traded before) or health data (e.g., a specific treatment that always results in zero cost).
It's hidden: The computer might not crash; it might just give you a weird, huge number and say, "I'm done!" You might think, "Oh, that's a real result," but it's actually a mathematical illusion.
It's worse with big data: Modern economics uses massive datasets with thousands of "fixed effects" (like specific years, specific cities, specific companies). The more complex the data, the easier it is to accidentally create these "impossible" scenarios.

2. The Old Solutions (and why they suck)

Before this paper, if a computer got stuck on this "impossible cake," researchers had two bad options:

Throw away a clue: "Okay, let's just ignore the 'Country' variable."
- The Problem: This changes the whole recipe. You might lose important information about other variables. It's like fixing a broken car by removing the engine; the car stops making noise, but it also doesn't drive anymore.
Add a "penalty": Force the computer to stop at a reasonable number, even if it's not perfect.
- The Problem: This changes the rules of the game. You aren't finding the true maximum anymore; you're finding a "compromise" maximum. It's like forcing the pencil to stay upright by gluing it to the table. It works, but it's not the real solution.

3. The New Solution: The "Iterative Rectifier"

The authors of this paper (Correia, Guimarães, and Zylkin) found a clever, third way.

The Insight:
They realized that the "impossible" observations (the ones causing the infinity problem) are actually perfectly predictable.

If the computer knows for a fact that "Country A" always equals zero, it doesn't need to do any math to figure that out. It's already solved.
These "perfectly predicted" observations are actually noise for the rest of the calculation. They are like a student in a math class who already knows the answer to every question; they don't help the teacher figure out how to teach the other students.

The Fix:

Identify the "Perfect" Observations: Use a new, fast algorithm (called the Iterative Rectifier) to find the specific data points that are causing the "infinity" problem.
- Analogy: Imagine a sieve that instantly filters out the rocks that are too big to fit in the bucket, leaving only the sand.
Remove Them Temporarily: Take those specific "perfect" observations out of the dataset.
Run the Math: Now, run the computer program on the remaining data. Because you removed the "infinity" triggers, the computer finds a perfect, finite recipe for everything else.
The Magic: The recipe you get for the remaining data is exactly the same as the recipe you would have gotten if you could have solved the impossible problem. The "perfect" observations didn't change the answer for the others; they just broke the calculator.

4. Why This Matters for Everyone

It's Fast: The old way to find these "impossible" points required solving a massive, slow puzzle (Linear Programming). The new method is like using a high-speed scanner. It can handle millions of data points in seconds.
It's Safe: You don't have to guess which variable to throw away. The computer tells you exactly which data points are the troublemakers.
It Saves Research: Many economic studies (like trade agreements or health costs) might have been using "broken" numbers without knowing it. This paper gives researchers a tool to clean their data and get the right answers.

Summary Analogy

Imagine you are trying to find the center of a crowd of people.

The Problem: A few people are standing on a cliff edge, and the rest are in a valley. If you try to find the "average" spot, the cliff people pull the average so far up that it doesn't exist on the map.
The Old Way: You either ignore the cliff people (losing their story) or force the average to stay in the valley (lying about the math).
The New Way: You quickly spot the people on the cliff, realize they are in a different "zone," and set them aside. You then find the perfect center of the people in the valley. You know exactly where the cliff people are, and you know your calculation for the valley is 100% accurate.

This paper gives economists the "spotter" to find those cliff-edge data points and the "calculator" to solve the rest of the puzzle correctly.

Here is a detailed technical summary of the paper "Verifying the existence of maximum likelihood estimates for generalized linear models" by Correia, Guimarães, and Zylkin.

1. The Problem: Nonexistence of MLEs in GLMs

The paper addresses a fundamental issue in nonlinear econometrics: maximum likelihood estimates (MLEs) are not guaranteed to exist for Generalized Linear Models (GLMs). While this problem, known as "separation," is well-documented in binary response models (e.g., Logit/Probit), it remains under-recognized and poorly understood in broader contexts, particularly:

Non-binary outcomes: Specifically Poisson regression and other count data models widely used in economics (e.g., trade flows, patent citations).
High-dimensional settings: Models featuring multiple levels of fixed effects (e.g., exporter-time, importer-time, and pair fixed effects in gravity models).
Pseudo-Maximum Likelihood (PML): The issue extends to PML estimators, which are often used when the true data distribution is unknown or when handling zero-inflated data.

When separation occurs, the likelihood function increases indefinitely as parameters approach infinity, leading to non-convergence or numerical instability in estimation algorithms. Standard software often fails to detect this, producing spurious results or arbitrary estimates depending on convergence tolerances.

2. Methodology and Theoretical Framework

A. Theoretical Conditions for Existence

The authors formalize the conditions under which MLEs exist for a broad class of GLMs defined by the exponential family log-likelihood:
$l(\beta) = \sum_i [\alpha_i(\phi) y_i \theta_i - \alpha_i(\phi) b(\theta_i) + c(y_i, \phi)]$

Proposition 1 (General GLMs):
For models where the individual likelihood contribution has a finite upper bound (e.g., Poisson, Logit, Probit), an MLE exists if and only if there is no linear combination of regressors $z_i = x_i \gamma^*$ that "separates" the data. Separation is defined by the existence of a non-zero vector $\gamma^*$ such that:

$z_i = 0$ for all observations where $0 < y_i < \bar{y}$ (interior observations).
$z_i \geq 0$ for all observations where $y_i = \bar{y}$ (upper boundary).
$z_i \leq 0$ for all observations where $y_i = 0$ (lower boundary).

If such a vector exists, the likelihood can be increased indefinitely by moving in the direction of $\gamma^*$ .

Proposition 2 (Gamma and Inverse Gaussian PML):
The authors derive stricter conditions for Gamma and Inverse Gaussian PML estimators (often used with zero-inflated data). Unlike Poisson, these estimators have likelihood contributions that are unbounded from above when $y_i=0$ and the linear predictor goes to $-\infty$ . Consequently, the conditions for existence are more restrictive; even if "overlap" exists (as defined in Poisson models), finite solutions may still not exist for these specific estimators.

B. Theoretical Implications of Separation

The paper establishes that separation is not merely a computational nuisance but a structural property.

Compactified Parameter Space: By extending the parameter space to include boundaries ( $\pm \infty$ ), a solution always exists.
Consistency of Subsets: Crucially, Proposition 3 demonstrates that even when separation occurs, the estimates for non-separated linear parameters (those orthogonal to the separating vector) remain consistent and uniquely identified.
Equivalence to Perfect Collinearity: Once separated observations are removed, the remaining problem is equivalent to perfect collinearity. The separated observations provide no information about the finite parameters, and their removal yields the same fit and inference as the "limiting conditional model."

C. Algorithmic Solution: The Iterative Rectifier (IR)

To detect separation in high-dimensional settings (where standard Linear Programming is computationally infeasible due to the "curse of dimensionality"), the authors propose a novel algorithm called the Iterative Rectifier (IR).

Mechanism: The algorithm solves a weighted least squares problem iteratively.
1. Define an artificial dependent variable $u_i$ (e.g., $-1$ if $y_i=0$ , $0 $if$ y_i>0$).
2. Assign weights $\omega_i$ such that $y_i>0$ observations have a very large weight $K$ .
3. Regress $u_i$ on $x_i$ .
4. Update $u_i$ for zero-outcome observations to be $\min(\hat{u}_i, 0)$ .
5. Repeat until convergence.
Detection: If the algorithm converges with predicted values $\hat{u}_i < 0$ for some observations, those observations are separated.
Efficiency: Leveraging recent innovations in high-dimensional fixed effects estimation (Correia, 2017), this method runs in nearly linear time, making it scalable to datasets with millions of observations and thousands of fixed effects.

3. Key Contributions

Unified Theory of Separation: The paper unifies the understanding of separation across binary, count, and continuous GLMs, clarifying that the phenomenon is a general property of exponential family models, not just binary choice.
Differentiation of Estimators: It highlights that Gamma PML and Inverse Gaussian PML suffer from more severe nonexistence issues than Poisson or Logit, particularly in the presence of zero outcomes, advising caution in their application to trade and health data.
Remedy via Observation Withholding: The authors provide a rigorous theoretical justification for withholding separated observations from the estimation sample. They prove this yields consistent estimates for the remaining parameters and valid inference, avoiding the need for penalized likelihoods (which alter the objective function and are incompatible with high-dimensional fixed effects).
Scalable Detection Algorithm: The introduction of the Iterative Rectifier (IR) algorithm solves the practical bottleneck of detecting separation in high-dimensional fixed effects models, a task previously considered infeasible.

4. Empirical Results

The authors apply their methods to a gravity model of trade (based on Baier et al., 2019) involving:

Data: Bilateral trade flows between 69 countries (1986–2006).
Complexity: ~2,800 pair fixed effects, ~2,200 time fixed effects, and 910 Free Trade Agreement (FTA) indicators.
Findings:
- Standard estimation without separation checks produced a massive, spurious coefficient for the Iceland-Romania FTA (due to zero trade prior to the agreement).
- The Iterative Rectifier correctly identified 7 separated observations (pre-FTA Iceland-Romania) and 42 other perfectly predicted zero-trade pairs.
- Comparison: Standard checks (like those in Stata's ppml) failed to detect the separation because they rely on checking individual regressors rather than linear combinations. The IR method successfully isolated the separated data, allowing for consistent estimation of all other FTA coefficients and fixed effects.

5. Significance and Conclusion

This paper resolves significant ambiguity in applied econometrics regarding the reliability of GLM estimates in high-dimensional settings.

Practical Impact: It provides researchers with a computationally feasible tool (ppmlhdfe with sep(ir)) to verify the existence of estimates before drawing conclusions.
Theoretical Clarity: It shifts the paradigm from "estimates do not exist" to "estimates exist for a subset of parameters if separated observations are handled correctly."
Policy Relevance: In fields like international trade and health economics, where zero-inflated data and high-dimensional fixed effects are standard, this work prevents the publication of biased or non-convergent results, ensuring that policy inferences drawn from these models are robust.

The authors conclude that while separation is a persistent challenge, it is manageable. By detecting and removing separated observations, researchers can recover consistent estimates for the estimable parameters without resorting to ad-hoc penalization or dropping arbitrary regressors.