Nonparametric Identification and Estimation of Causal Effects on Latent Outcomes

Imagine you are a detective trying to solve a mystery: Did a specific intervention (like a new teaching method or a political campaign) actually change people's minds?

In the world of science, the "mind" or "attitude" you are trying to measure is often invisible. You can't see "political trust," "cognitive ability," or "social capital" directly. You can only see the shadows they cast: survey answers, test scores, or voting records.

This paper, written by Jiawei Fu and Donald Green, tackles a massive problem: How do we compare invisible things when we measure them with different rulers?

Here is the breakdown of their solution, using some everyday analogies.

The Problem: The "Apples vs. Oranges" Trap

Imagine two researchers are studying the same thing: How much people love pizza.

Researcher A asks people to rate their love on a scale of 1 to 10.
Researcher B asks people to rate it on a scale of "Meh" to "Heavenly."

If Researcher A finds that a new pizza sauce increased love by "2 points," and Researcher B finds it increased love by "1 Heavenly point," you cannot compare them. Is 2 points on the 1-10 scale the same as 1 Heavenly point? We don't know.

In the real world, this happens all the time. One study might measure "democracy" using voting turnout, while another uses freedom of speech scores. Even if the actual change in democracy is the same, the numbers look different because the "rulers" are different.

The authors call this the Noncomparability Challenge. It's like trying to compare the height of a building measured in "feet" against one measured in "stacks of pancakes." Without a conversion, the data is useless for comparing studies.

The Old Way: The "Smoothie" Mistake

Previously, scientists tried to fix this by making a "smoothie." They would take all their different measurements (voting, speech, protests) and blend them together using a computer algorithm (like Principal Component Analysis) to create one single "Democracy Smoothie."

The problem? If Researcher A uses a blender and Researcher B uses a food processor, the smoothies taste different. Even if they used the same fruit (the same underlying reality), the final drink isn't comparable. This leads to fake differences in results, making it look like an intervention worked in one place but failed in another, when really, they just used different tools.

The New Solution: The "Universal Translator"

The authors propose a new method called Nonparametric Scaled Index (NSI). Think of this as building a Universal Translator for invisible concepts.

Here is how it works in three simple steps:

1. Pick a "Benchmark" (The Anchor)

In every study, you must pick one measurement to be the "Gold Standard" or the Anchor. Let's say in our pizza study, we decide that the "1 to 10 scale" is our Anchor.

2. Build a "Bridge" (The Translator)

For every other measurement (like the "Meh to Heavenly" scale), we need to build a Bridge Function.

Imagine the "Meh to Heavenly" scale is a foreign language.
The Bridge Function is a translator that says: "Okay, when someone says 'Heavenly,' that actually means '9' on our 1-to-10 scale."
Crucially, this translator doesn't need to be a simple math formula (like $y = 2x$ ). It can be a complex, wiggly, non-linear relationship. The computer learns the shape of the bridge.

3. Cross-Study Comparison

Now, if a second study comes along with a different set of measurements (maybe they used a "Spicy to Mild" scale), as long as they also have the "1 to 10 scale" as their Anchor, we can translate their "Spicy to Mild" scale into the "1 to 10" scale too.

Suddenly, both studies are speaking the same language. We can finally compare them fairly.

How Do We Build the Bridge Without Seeing the Invisible?

You might ask: "If we can't see the 'real' pizza love, how do we know the translator is right?"

The authors use a clever trick involving Randomized Experiments.

Because the experiment randomly assigns people to get the treatment (new sauce) or not, the treatment acts like a flashlight.
The treatment changes the "real" pizza love.
By watching how the different measurements (the 1-10 scale and the Heavenly scale) react to the same flashlight (the treatment), the computer can figure out how they relate to each other.
It's like seeing how two different thermometers react when you put them in the same hot water. Even if one reads in Celsius and one in Fahrenheit, you can figure out the conversion rule just by watching them both heat up together.

Why This Matters

No More Fake Differences: It stops scientists from thinking an intervention failed just because they used a different survey question.
Flexible: It doesn't force the data into a straight line (linear model). It allows for complex, real-world relationships.
Better Design: It tells researchers: "Hey, if you want your study to be comparable to others, you MUST include at least one common question (the Anchor) that everyone else uses."

The Bottom Line

This paper gives scientists a new toolkit to measure the unmeasurable. Instead of blindly blending data into a smoothie and hoping for the best, they now have a Universal Translator. This ensures that when we compare studies across the world, we are actually comparing apples to apples, not apples to pancakes.

1. Problem Statement

The paper addresses a pervasive issue in social science causal inference: how to estimate treatment effects when the outcome of interest is latent (unobserved) and measured imperfectly by multiple indicators.

While randomized experiments are the gold standard for causal identification, researchers often care about constructs like ideology, state capacity, or mental health, which cannot be observed directly. Instead, they rely on multiple noisy proxies (e.g., survey items, test scores, administrative records). Existing methods for aggregating these indicators (e.g., simple averaging, Principal Component Analysis (PCA), Inverse Covariance Weighting (ICW), Item Response Theory (IRT)) face two distinct noncomparability challenges:

Study Noncomparability: When different studies measure the same latent construct using different sets of indicators, standard dimension-reduction methods often produce estimators that target different empirical quantities. Even if the true Average Latent Treatment Effect (ALTE) is identical, the estimated effects may differ simply due to measurement differences, undermining cross-study knowledge accumulation.
Measurement Noncomparability (Within-Study): Within a single study, different indicators may have distinct, potentially nonlinear relationships with the latent outcome. Standard methods either impose strong parametric assumptions (e.g., linear SEM or IRT) which risk misspecification, or remain agnostic (e.g., PCA) which ignores the shared latent structure, leading to efficiency losses.

2. Methodology: Nonparametric Scaled Index (NSI)

The authors propose a Nonparametric Scaled Index (NSI) framework. This approach is design-based and relies on two core concepts: a Benchmark Measurement and Measurement Bridge Functions.

A. Core Framework

Benchmark Variable ( $Y_1$ ): The method assumes the existence of at least one common measurement across studies (or within a study) that serves as a "benchmark." This variable is assumed to be centering, meaning $E[Y_1 | \eta] = \eta$ , where $\eta$ is the latent outcome.
Bridge Functions ( $\phi_j$ ): For every other measurement $Y_j$ , the method seeks a nonparametric function $\phi_j$ such that:
$E[\phi_j(Y_j) | \eta] = E[Y_1 | \eta] = \eta$
This function "bridges" the imperfect measurement $Y_j$ onto the scale of the benchmark $Y_1$ in expectation.

B. Identification Strategy

The identification of the bridge function $\phi_j$ is framed as a Nonparametric Instrumental Variables (NPIV) problem.

Equation: The condition $E[\phi_j(Y_j) | \eta] = E[Y_1 | \eta]$ is transformed using the law of iterated expectations and auxiliary variables (Instruments, $W_i$ ) to:
$E[\phi_j(Y_j) | W_i] = E[Y_1 | W_i]$
Instruments ( $W_i$ ): Unlike traditional NPIV which requires external instruments, this framework utilizes variables naturally available in randomized experiments:
- Treatment assignment ( $Z_i$ ).
- Pre-treatment covariates ( $X_i$ ).
- Other measurements ( $Y_k$ ) not currently being bridged.
Assumptions:
- Completeness: The instrument $W_i$ must be sufficiently rich to uniquely identify $\phi_j$ . Specifically, if $E[g(Y_1)|W_i] = 0$ , then $g(Y_1)=0$ almost surely.
- Mean Independence: The instruments must be mean-independent of the measurement errors conditional on the latent outcome.

C. Estimation under Weak Identification

Estimating $\phi_j$ is an ill-posed inverse problem (NPIV), which can lead to weak identification and instability. However, the authors note that the target parameter (ALTE) is a linear functional of the nuisance function $\phi_j$ .

Approach: They adapt the Bennett et al. (2025) framework for inference on linear functionals of weakly identified nuisance parameters.
Procedure:
1. Minimax Estimation: Estimate the bridge function $\hat{\phi}_j$ and a debiasing nuisance function $\hat{q}_j$ using a penalized minimax criterion (similar to Double/Debiased Machine Learning).
2. Cross-Fitting: To avoid overfitting bias, the sample is split into folds. Nuisance functions are estimated on training folds and applied to validation folds.
3. Orthogonal Score: Construct a Neyman-orthogonal score function that is robust to errors in the first-stage estimation of $\phi_j$ .
4. GMM: Use Generalized Method of Moments (GMM) to combine the over-identified moment conditions from multiple measurements to estimate the final ALTE ( $\tau$ ).

3. Key Contributions

Theoretical Identification: The paper establishes that causal effects on latent outcomes can be identified nonparametrically without specifying the functional form of the relationship between indicators and the latent variable. It proves that bridge functions exist under completeness conditions and can be identified using experimental design features (treatment and covariates) as instruments.
Solving Noncomparability: The NSI framework explicitly solves the study and measurement noncomparability challenges. By mapping all indicators to a common benchmark scale in expectation, it ensures that:
- Estimates within a study are comparable.
- Estimates across studies (sharing a benchmark) are comparable, even if the other indicators differ.
Robustness to Misspecification: Unlike IRT or linear SEM, NSI does not assume linearity or specific error distributions. It accommodates arbitrary nonlinear relationships between indicators and the latent outcome.
Practical Guidance: The authors argue that measurement design is part of causal design. They recommend including at least one common "benchmark" indicator across studies to facilitate meta-analysis and comparison.

4. Results

Simulation Study

The authors simulated two studies with identical true ALTEs but different measurement systems (Study 1 used $Y_1, Y_2, Y_3$ ; Study 2 used $Y_1, Y'_2, Y'_3$ where $Y'$ were nonlinear transformations of the latent variable).

PCA and ICW: Failed significantly. They produced large cross-study gaps (0.256 and 0.366, respectively) and rejected the null hypothesis of equal ALTEs in 27.4% and 100% of replications, respectively.
WSI (Weighted Scaled Index): A linear parametric method proposed by Fu & Green (2025). Performed better (gap 0.072, rejection 1.2%) but relied on linearity.
NSI (Proposed): Achieved the lowest cross-study gap (0.004) and the lowest rejection rate (0.6%), successfully recovering the true ALTE despite nonlinear measurement errors.

Empirical Application

The method was applied to the Kalla & Broockman (2020) field experiment on door-to-door canvassing and attitudes toward undocumented immigrants.

Setup: Two outcome scales (Attitudes vs. Policy Views) and two treatment arms.
Findings: The NSI estimates (using polynomial, Random Forest, and RKHS bases) were consistent with the linear WSI benchmark but more robust.
- Full treatment effect: ~0.40 (statistically significant).
- Abbreviated treatment effect: ~0.08 (not significant).
Conclusion: The substantive conclusion (full canvassing shifts latent attitudes) held regardless of whether linear or nonparametric bridge functions were used, validating the robustness of the approach.

5. Significance

This paper fundamentally shifts the perspective on latent outcomes in causal inference. It moves away from treating measurement as a post-hoc nuisance to be "fixed" by dimension reduction, and instead treats measurement design as integral to the definition of the causal estimand.

For Researchers: It provides a rigorous, nonparametric tool to combine multiple imperfect indicators without assuming a specific structural model, ensuring that results are comparable across different research contexts.
For the Field: It bridges the gap between the literature on proximal causal inference (using bridge functions for unobserved confounders) and latent variable modeling. It demonstrates that with proper experimental design (specifically, the inclusion of a benchmark variable), causal inference on complex, unobserved constructs is feasible and robust.
Policy Implication: It suggests that for meta-analysis and cumulative science to work, researchers must standardize at least one measurement instrument across studies to anchor the latent scale.