Identification of Latent Group Effects under Conditional Calibration

Imagine you are a detective trying to solve a mystery: How much does a specific hidden trait change a person's life outcome?

Let's say you want to know how much being "fuel insecure" (unable to afford heating) affects a person's health. The problem? You don't have a list of who is fuel insecure. That data is missing.

However, you do have a very smart computer algorithm that looks at a person's income, location, and job, and gives them a probability score (let's call it a "Risk Score").

If the score is 0.9, the algorithm is 90% sure they are fuel insecure.
If the score is 0.1, it's 90% sure they are not.

The big question this paper answers is: Can we use these fuzzy probability scores to figure out the exact truth about the hidden group, even though we never see the group itself?

Here is the breakdown of the paper's findings, explained with simple analogies.

1. The Magic Formula (The "Detective's Trick")

The author, Marcell Kurbucz, discovered a mathematical "magic trick" (a formula) that lets you calculate the true effect of the hidden group using only the scores and the outcomes.

The Analogy: The Noisy Radio
Imagine the hidden group status (Fuel Insecure vs. Not) is a radio station playing a song. You can't hear the station directly. But you have a radio tuner (the probability score) that picks up the signal.

Sometimes the tuner is perfect.
Sometimes the tuner is staticky (noisy).

The paper proves that as long as the tuner has some static (some randomness that isn't just based on the person's background), you can actually tune out the noise and hear the song clearly.

The formula is essentially:

True Effect = (How much the score moves) × (How much the outcome moves) / (How much the score moves on its own)

If the score is perfectly predictable based on a person's background (e.g., "Everyone with income < $20k gets a score of 0.9"), the formula breaks. Why? Because the score isn't telling you anything new; it's just repeating what you already know. The "static" (random variation) is what makes the detective work possible.

2. When the Trick Fails (The "Dead End")

The paper draws a very sharp line.

Success: If the probability score has any random wiggle room that isn't explained by other data, you can find the answer.
Failure: If the score is a robot that always gives the exact same number for a specific type of person, you are stuck.

The Analogy: The Broken Compass
Imagine trying to find North. If your compass spins wildly, you can't find North. But if your compass is stuck pointing exactly North, you also can't find North because you don't know if it's working or just broken.
In this paper, if the score is a "stuck compass" (a deterministic function of other data), you cannot tell if the hidden group has an effect or not. The math shows that in this broken state, you could invent a thousand different "truths" that all look exactly the same to an observer.

3. The "Average" vs. The "Real" Effect

The paper also warns us about a common trap.

The Marginal Gap: This is the simple difference in health between "Group A" and "Group B" if you just looked at the averages.
The Structural Effect: This is the true causal effect of the group status itself.

The Analogy: The Ice Cream Shop
Imagine you want to know if eating ice cream makes you taller.

Marginal Gap: You compare the average height of people who eat ice cream vs. those who don't. Maybe ice eaters are taller.
The Catch: Maybe ice cream eaters are also richer, and rich kids eat better food and grow taller. The "ice cream" isn't the cause; the "richness" is.

The paper shows that the formula they found measures the Structural Effect (the pure ice cream effect), not the Marginal Gap. To get the Marginal Gap, you'd need to know exactly how the hidden groups are distributed across different backgrounds, which you don't have. But the Structural Effect is often the more useful number for policy anyway.

4. What Happens if the Algorithm is Wrong? (Robustness)

What if the computer's probability scores are slightly wrong? Maybe it's usually 90% right, but sometimes it's off by 5%?

The paper calculates exactly how much this error messes up your result.

The Good News: If the algorithm's errors are random (sometimes too high, sometimes too low), they cancel out, and your result is still pretty good.
The Bad News: If the algorithm is systematically wrong (e.g., it always overestimates the risk for people in a specific neighborhood), your result will be biased.
The Safety Net: The paper gives a "worst-case scenario" formula. It tells you: "Even if the algorithm is wrong by up to 5%, your answer won't be off by more than X amount." This helps researchers know how much they can trust their results.

5. Why Not Just Guess? (Hard Thresholds)

A common mistake researchers make is: "If the score is above 0.5, I'll call them 'Group A'. If below, 'Group B'." Then they compare the two groups.

The Analogy: The Blunt Knife
The paper says this is like trying to cut a diamond with a butter knife. It's too blunt.
By forcing a "Yes/No" decision on a fuzzy probability, you throw away all the nuance. The paper proves mathematically that this "Hard Threshold" method will always underestimate the true effect. It's like looking at a blurry photo and squinting; you'll see less detail than if you used the right lens (the formula).

Summary: The Takeaway

This paper is a toolkit for researchers who are missing data.

You can solve the mystery: Even without seeing the hidden group, if you have a "calibrated" probability score, you can calculate the true effect.
You need noise: The score must have some randomness; if it's too perfect, the math breaks.
Don't guess: Don't just turn the scores into Yes/No lists; use the specific formula to get the real answer.
Know your limits: If the scores are biased, the paper tells you exactly how much your answer might be off.

It turns a "missing data" problem into a solvable math puzzle, provided you have a good probability score to work with.

1. Problem Statement

The paper addresses a pervasive challenge in empirical economics and social science: estimating the structural effect of a latent group membership ( $G \in \{0, 1\}$ ) on an outcome ( $Y$ ) when the group indicator $G$ is unobserved.

Context: In many settings (e.g., poverty status, immigration status, informal employment), analysts do not observe the binary group indicator directly. Instead, they observe a calibrated probability score $p \in [0, 1]$ which represents the analyst's belief that unit $i$ belongs to the group.
Core Question: Under what conditions, and via what formula, can the structural group effect $\tau$ be identified from the joint distribution of observables $(Y, X, p)$ , given that $G$ is never observed?
Key Assumption: The score $p$ satisfies the conditional calibration property: $E[G \mid p, X] = p$ . This means $p$ is an unbiased predictor of $G$ given the observed covariates $X$ and the score itself.

2. Methodology and Model

The author proposes a constant-coefficient structural mean model combined with moment-based identification.

Model Setup:

Structural Equation: $E[Y \mid G, p, X] = \mu(X) + \tau G$ $E [Y ∣ G, p, X] = μ (X) + τ G$ .
- $\tau$ is the constant structural effect of group membership.
- $\mu(X)$ captures the baseline outcome dependence on covariates.
Calibration Condition: $E[G \mid p, X] = p$ .
Residuals:
- $R = Y - m(X)$ , where $m(X) = E[Y \mid X]$ .
- $a = p - r(X)$ , where $r(X) = E[p \mid X]$ .
- $z = 2p - 1$ (signed score).
Key Quantity: $V^* = E[(p - r(X))^2] = E[\text{Var}(p \mid X)]$ . This represents the residual variance of the score after conditioning on covariates.

Identification Strategy:
The paper derives a closed-form moment equation linking the observables to the latent parameter $\tau$ .

Theorem 3 (Population Moment Identity): Under the model assumptions, the following identity holds:
$E[(2p - 1)(Y - m(X))] = 2\tau V^*$
Identification Formula:
$\tau = \frac{E[(2p - 1)(Y - m(X))]}{2 E[(p - r(X))^2]}$
This is interpreted as a weighted ratio of moments. The numerator is the covariance between the signed score and the outcome residual (partialled by $X$ ), and the denominator is twice the residual variance of the score.

Analogy to Instrumental Variables (IV):
The identification is formally analogous to an IV estimator:

The residual score $a = p - r(X)$ acts as an instrument for the latent deviation $(G - r(X))$ .
The calibration condition provides the relevance (first stage).
The structural mean independence provides the exclusion restriction.

3. Key Contributions

A. Point Identification and Failure Conditions

Identification: $\tau$ is point-identified if and only if $V^* > 0$ .
Sharp Failure Characterization: If $V^* = 0$ (i.e., $p$ is a deterministic function of $X$ ), identification fails completely. The author constructs an explicit continuum of observationally equivalent models with arbitrary $\tau'$ values that yield the same observable distribution $(Y, X, p)$ . This proves that without residual variation in the score, no information about $\tau$ can be recovered.

B. Structural Coefficient vs. Marginal Gap

The paper distinguishes the identified structural coefficient $\tau$ from the marginal latent mean gap ( $\Delta_{marg} = E[Y \mid G=1] - E[Y \mid G=0]$ ).

Decomposition: $\Delta_{marg} = \tau + C$ , where $C$ is a compositional term representing differences in covariate distributions between latent groups.
Result: $\tau$ identifies the within-covariate-cell effect. $\Delta_{marg}$ equals $\tau$ if and only if the latent groups are covariate-balanced ( $E[\mu(X) \mid G=1] = E[\mu(X) \mid G=0]$ ). Without further assumptions, $C$ is unidentified.

C. Estimation and Inference

Oracle Estimator: Assuming $m(X)$ and $r(X)$ are known, the estimator $\hat{\tau}_{or}$ is $\sqrt{n}$ -consistent and asymptotically normal with a closed-form sandwich variance.
Plug-in Estimator: When nuisance functions are estimated, the paper discusses consistency. It notes that the standard score fails Neyman orthogonality in the direction of $m(X)$ , potentially causing bias if nuisance estimators are not sufficiently precise.
Orthogonal Estimator: A reformulated estimator using the score $\tilde{\psi}_i = 2(p_i - r(X_i))(Y_i - m(X_i) - \tau(p_i - r(X_i)))$ is proposed. This score is Neyman-orthogonal, making it a candidate for Double Machine Learning (DML) inference, though formal $\sqrt{n}$ -normality under cross-fitting is left for future work.

D. Robustness to Calibration Failure

The paper analyzes the impact of violating the calibration assumption ( $E[G \mid p, X] \neq p$ ).

Bias Formula: If $E[G \mid p, X] = p + \eta(p, X)$ , the estimator converges to $\tau + B_{cal}$ , where the bias $B_{cal}$ is proportional to the correlation between the calibration error $\eta$ and the signed score $(2p-1)$ .
Sharp Sensitivity Bound: The maximum possible bias given a uniform bound $|\eta| \leq \delta$ is derived as:
$\sup |B_{cal}| = \frac{|\tau| \cdot \delta \cdot E[|2p - 1|]}{2V^*}$
This bound is sharp and highlights that bias explodes as $V^* \to 0$ .

4. Empirical Results (Monte Carlo Simulations)

The paper validates the theory through extensive simulations:

Asymptotic Normality: The oracle estimator is unbiased and follows a normal distribution for $n \geq 1,000$ .
Identification Boundary: As $V^* \to 0$ , the Root Mean Squared Error (RMSE) diverges at a rate proportional to $1/V^*$ , confirming the theoretical identification failure.
Calibration Sensitivity: The empirical bias under miscalibration matches the derived sharp bound. Symmetric calibration errors (orthogonal to the score) produce zero bias, while worst-case errors produce maximal bias.
Hard-Thresholding: Replacing $p$ with a binary indicator ( $1\{p > 0.5\}$ ) leads to severe attenuation bias. The estimated gap is shrunk by a factor $\kappa < 1$ , which worsens as the score dispersion decreases. The moment-based estimator strictly dominates thresholding.
Heterogeneous Effects: When $\tau$ varies with $X$ , the moment estimator identifies a variance-weighted average $\bar{\tau} = E[\tau(X) \text{Var}(p \mid X)] / E[\text{Var}(p \mid X)]$ , upweighting units where the score is most informative.

5. Significance and Implications

Theoretical Advance: Provides a rigorous identification framework for latent group effects using calibrated probabilities, a common data structure in modern algorithmic and administrative settings.
Practical Utility: Offers a closed-form estimator that avoids the need for complex nonparametric identification conditions (unlike proxy variable literature) and avoids the severe bias of simple thresholding.
Policy Relevance: Crucial for fairness auditing and distributional analysis where protected attributes are missing but predictable. It clarifies that the "group gap" estimated by such methods is a structural, within-covariate effect, distinct from the raw marginal gap, preventing misinterpretation of compositional differences as causal effects.
Robustness: The sharp sensitivity bounds allow researchers to quantify exactly how much miscalibration can distort their results, providing a concrete tool for robustness checks.

In summary, the paper establishes that conditional calibration combined with residual score variation is sufficient to identify structural group effects, providing a robust, closed-form solution that outperforms naive classification approaches.