Econometric Inference with Machine-Learned Proxies:… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to solve a mystery about a hidden criminal (let's call him Z). You have a lot of clues, but you can't see Z directly. Instead, you have a very smart, high-tech robot (the Machine Learning model) that looks at a mountain of raw evidence (like blurry photos or messy text, called X) and gives you a "suspect profile" (called $\hat{Z}$ ).

The problem? The robot isn't perfect. Sometimes it's right, sometimes it's wrong, and sometimes it gets confused by things that look like the criminal but aren't.

If you just blindly trust the robot's profile and plug it into your investigation, you might catch the wrong guy or miss the real one. This is the problem this paper solves.

Here is the paper's solution, broken down into simple concepts and analogies:

1. The Two Datasets: The "Training Camp" and the "Crime Scene"

The author suggests you need two different sets of information to solve this mystery:

The Crime Scene (Downstream Data): This is where you are trying to solve the main economic problem. You have the clues (X) and the robot's suspect profile ( $\hat{Z}$ ), but you don't have the real criminal (Z).
The Training Camp (Validation Data): This is a separate dataset where you do have the real criminal (Z) and the robot's profile ( $\hat{Z}$ ) side-by-side. You might not have all the other clues here, but you know exactly how often the robot is right or wrong.

The Analogy: Imagine you are trying to guess the weight of a mystery box.

Crime Scene: You have a fancy digital scale (the robot) that gives you a number, but you don't know if the scale is accurate.
Training Camp: You have a separate room where you weigh the same boxes on the fancy scale and on a perfect, heavy-duty industrial scale. You use this room to learn exactly how the fancy scale behaves.

2. The Big Idea: The "Bridge" Instead of a "Substitute"

Most researchers make a mistake: they treat the robot's guess ( $\hat{Z}$ ) as if it is the real thing (Z). They say, "Okay, the robot says it's 50kg, so it is 50kg." This leads to errors.

This paper says: Don't treat the robot's guess as the answer. Treat it as a bridge.

Think of the robot's guess ( $\hat{Z}$ ) as a bridge connecting the "Training Camp" to the "Crime Scene."

In the Training Camp, we know the relationship between the Bridge and the Real Criminal.
In the Crime Scene, we see the Bridge.
By walking across the bridge, we can carry the knowledge of "how the robot behaves" from the Training Camp to the Crime Scene, without ever needing to see the Real Criminal directly in the Crime Scene.

3. The "Partial Identification" Strategy: Drawing a Safe Zone

Instead of trying to find the exact weight of the criminal (which might be impossible if the robot is bad), the paper asks: "What is the range of weights that is still possible?"

This is called Partial Identification.

If the robot is very accurate, the "Safe Zone" (the range of possible weights) is tiny.
If the robot is terrible, the "Safe Zone" is huge.
The Key Benefit: Even if the robot is terrible, the "Safe Zone" is still valid. You won't be wrong; you'll just be less precise. This is much better than being confidently wrong.

4. The Mathematical Magic: "Optimal Transport"

To calculate this "Safe Zone," the paper uses a mathematical tool called Optimal Transport.

The Metaphor: Imagine you have a pile of dirt (the distribution of the robot's guesses in the Training Camp) and a hole to fill (the distribution of the real criminals). You want to move the dirt to fill the hole with the least amount of effort.
The paper uses a clever trick to solve this math problem without getting stuck in a computer nightmare. Instead of trying to match every single specific guess to a specific criminal (which is too hard), they look at the overall shape of the piles. This makes the math solvable on a regular computer.

5. The "No-Resampling" Trick: The "Split-Test"

Usually, when statisticians want to be sure their results are real, they use a method called "bootstrapping," which is like running the experiment 1,000 times on a computer to see if the result holds up. This takes forever.

This paper invented a faster way called Cross-Fitting:

The Analogy: Imagine you have a deck of cards. You split the deck in half.
- Group A uses the first half to figure out the rules of the game.
- Group B uses the second half to test if those rules work.
- Then, you swap them.
By doing this, the researchers can calculate a "confidence score" instantly using standard math tables, without needing to run thousands of simulations. It's like getting a fast, reliable verdict without waiting for a long jury deliberation.

Why This Matters

For Economists: You can now use powerful, complex AI tools to measure things like "political bias in news" or "air pollution" without worrying that the AI is slightly off. You get a valid answer with a clear "margin of error."
For AI Developers: It changes how we judge AI. We shouldn't just ask, "How accurate is the AI?" We should ask, "Does the AI preserve enough information to help us solve the economic problem?"
For Everyone: It shows that even if our tools aren't perfect, we can still get trustworthy answers if we know how to combine our data correctly.

In a nutshell: This paper gives us a new, robust way to use AI in economics. It treats AI predictions not as perfect facts, but as a bridge to connect what we know with what we want to learn, ensuring we never draw a false conclusion, even when the AI is imperfect.

1. Problem Statement

Empirical researchers increasingly use Machine Learning (ML) to construct proxies ( $\hat{Z}$ ) for latent target variables ( $Z$ ) from complex, unstructured data (e.g., text, images). However, naively plugging these ML-generated proxies into downstream econometric models leads to biased estimation and invalid inference due to:

Measurement Error: $\hat{Z}$ is an imperfect substitute for $Z$ .
Non-classical Measurement Error: The error ( $Z - \hat{Z}$ ) is often correlated with covariates ( $W$ ) and may be endogenous to the economic model.
Complexity: The statistical properties of $\hat{Z}$ (e.g., convergence rates, consistency) are often unknown or intractable for complex ML algorithms.
Data Limitations: Standard approaches often require a "complete" validation sample containing joint observations of $(W, Z, \hat{Z})$ , which is frequently unavailable. Researchers typically only have a downstream sample $(W, \hat{Z})$ and an auxiliary validation sample $(Z, \hat{Z})$ without $W$ .

2. Methodology

The paper proposes a framework for partial identification that treats the ML proxy not as a noisy substitute for $Z$ , but as a linking variable between two distinct datasets.

A. Data Structure

The framework utilizes two independent samples:

Downstream Sample: Contains observed covariates $W$ , unstructured inputs $X$ , and the proxy $\hat{Z} = g(X)$ . The true $Z$ is unobserved.
Auxiliary Validation Sample: Contains the true latent variable $Z$ , the proxy $\hat{Z}$ , and potentially a low-dimensional summary $S = h(X)$ of the unstructured data. It does not need to contain $W$ .

B. Identification Strategy: Unconditional Optimal Transport (OT)

The core innovation is reframing the identification problem as a data combination problem solvable via Optimal Transport.

Compatibility: The method assumes the downstream and validation samples are marginals of a common population distribution $H_0$ over $(W, Z, \hat{Z}, S)$ .
Decoupling: Instead of solving a continuum of conditional OT problems (which is computationally intractable for continuous/high-dimensional $\hat{Z}$ ), the author introduces auxiliary variables $(\hat{Z}', S')$ to decouple the samples.
Moment Reformulation: The exact matching condition ( $\hat{Z} = \hat{Z}'$ ) is moved from the coupling constraint to the moment conditions. The identified set $\Theta_I$ is characterized by the existence of a joint distribution $H'$ such that:
$\max_{\lambda \in B} \min_{H' \in \mathcal{H}'(F,G)} \mathbb{E}_{H'}[\lambda^\top \tilde{q}(W, Z, \hat{Z}, S, \hat{Z}', S'; \theta)] \leq 0$
where $\tilde{q}$ includes the structural moments and penalties for mismatched proxies.
Sharpness: This unconditional OT characterization provides a sharp identified set (the bounds cannot be tightened further given the data and assumptions).

C. Inference Procedure

The paper develops a tractable inference procedure to test if a candidate $\theta$ belongs to the identified set:

Kantorovich Duality: The min-max problem is transformed into a convex optimization problem using the dual formulation of OT.
Sieve Approximation: The infinite-dimensional dual function space is approximated by a finite-dimensional sieve space (e.g., polynomial or basis functions) to make computation feasible.
Cross-Fitting: To avoid the computational burden of bootstrapping and handle the non-standard asymptotics of OT problems, the author employs sample splitting and cross-fitting:
- Split data into two folds.
- Use Fold 1 to estimate the dual functions (optimizers).
- Use Fold 2 to evaluate the test statistic.
- Swap roles and aggregate results.
Critical Values: The procedure derives analytical critical values from the standard normal distribution (using a Bonferroni correction for the two folds), avoiding resampling methods.

3. Key Contributions

Relaxed Assumptions: The framework does not require:
- Structural assumptions on the measurement error (e.g., conditional independence).
- Asymptotic guarantees (consistency or convergence rates) for the upstream ML algorithm.
- A complete validation sample containing $(W, Z, \hat{Z})$ .
Novel Identification Framework: It introduces an unconditional OT approach for data combination, offering a computationally tractable alternative to conditional OT methods (like Fan et al., 2025) when the linking variable has large or continuous support.
Flexible Proxy Handling: The method accommodates proxies that differ in dimension or space from the target variable (e.g., using predicted probabilities instead of binary classifications, or combining multiple ML outputs).
Resampling-Free Inference: It provides a valid inference procedure with analytical critical values, bypassing the computational costs and theoretical difficulties of bootstrap methods in OT settings.

4. Results

Theoretical: The paper proves that the proposed unconditional OT characterization yields a sharp identified set (Theorem 1) and that the cross-fitted inference procedure controls asymptotic size (Theorem 2).
Simulation Evidence:
- Size Control: The proposed test maintains correct size (rejecting the null at the nominal rate) across various sample sizes and prediction noise levels. In contrast, naive plug-in F-tests suffer from severe over-rejection (up to 100% rejection) when prediction noise is moderate or high.
- Informativeness: The width of the confidence sets tightens as the predictive accuracy of the ML proxy improves and as sample sizes increase.
- Stratification: Incorporating a stratifying variable $S$ (e.g., subpopulations with different prediction difficulties) significantly tightens the identified bounds in heterogeneous settings.
- Continuous Proxies: Using continuous proxies (e.g., predicted probabilities) with higher-order sieve approximations yields tighter confidence sets than binary proxies, demonstrating the value of preserving information in the proxy.

5. Significance

This paper bridges the gap between modern Machine Learning and classical Econometrics. It provides a rigorous, practical toolkit for researchers who wish to leverage powerful ML tools for measurement without being constrained by the theoretical limitations of those tools or the unavailability of perfect validation data.

For Applied Researchers: It allows the use of "black-box" ML predictors in structural models while ensuring valid inference, provided a modest validation sample exists.
For ML Developers: It suggests a new criterion for evaluating ML models: not just predictive accuracy, but the extent to which the output preserves information relevant to downstream economic moment conditions.
For Econometric Theory: It advances the literature on partial identification and data combination by establishing new results using unconditional optimal transport and providing a novel cross-fitted inference strategy for non-standard problems.

Econometric Inference with Machine-Learned Proxies: Partial Identification via Data Combination