Econometric Inference with Machine-Learned Proxies: Partial Identification via Data Combination

This paper proposes a framework for partial identification and inference in general moment models that utilizes machine-learned proxies by treating them as linking variables between a downstream sample and an auxiliary validation sample, thereby enabling valid asymptotic inference without restrictive assumptions on the upstream machine learning procedure or the need for resampling.

Original authors: Lixiong Li

Published 2026-04-14
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to solve a mystery about a hidden criminal (let's call him Z). You have a lot of clues, but you can't see Z directly. Instead, you have a very smart, high-tech robot (the Machine Learning model) that looks at a mountain of raw evidence (like blurry photos or messy text, called X) and gives you a "suspect profile" (called Z^\hat{Z}).

The problem? The robot isn't perfect. Sometimes it's right, sometimes it's wrong, and sometimes it gets confused by things that look like the criminal but aren't.

If you just blindly trust the robot's profile and plug it into your investigation, you might catch the wrong guy or miss the real one. This is the problem this paper solves.

Here is the paper's solution, broken down into simple concepts and analogies:

1. The Two Datasets: The "Training Camp" and the "Crime Scene"

The author suggests you need two different sets of information to solve this mystery:

  • The Crime Scene (Downstream Data): This is where you are trying to solve the main economic problem. You have the clues (X) and the robot's suspect profile (Z^\hat{Z}), but you don't have the real criminal (Z).
  • The Training Camp (Validation Data): This is a separate dataset where you do have the real criminal (Z) and the robot's profile (Z^\hat{Z}) side-by-side. You might not have all the other clues here, but you know exactly how often the robot is right or wrong.

The Analogy: Imagine you are trying to guess the weight of a mystery box.

  • Crime Scene: You have a fancy digital scale (the robot) that gives you a number, but you don't know if the scale is accurate.
  • Training Camp: You have a separate room where you weigh the same boxes on the fancy scale and on a perfect, heavy-duty industrial scale. You use this room to learn exactly how the fancy scale behaves.

2. The Big Idea: The "Bridge" Instead of a "Substitute"

Most researchers make a mistake: they treat the robot's guess (Z^\hat{Z}) as if it is the real thing (Z). They say, "Okay, the robot says it's 50kg, so it is 50kg." This leads to errors.

This paper says: Don't treat the robot's guess as the answer. Treat it as a bridge.

Think of the robot's guess (Z^\hat{Z}) as a bridge connecting the "Training Camp" to the "Crime Scene."

  • In the Training Camp, we know the relationship between the Bridge and the Real Criminal.
  • In the Crime Scene, we see the Bridge.
  • By walking across the bridge, we can carry the knowledge of "how the robot behaves" from the Training Camp to the Crime Scene, without ever needing to see the Real Criminal directly in the Crime Scene.

3. The "Partial Identification" Strategy: Drawing a Safe Zone

Instead of trying to find the exact weight of the criminal (which might be impossible if the robot is bad), the paper asks: "What is the range of weights that is still possible?"

This is called Partial Identification.

  • If the robot is very accurate, the "Safe Zone" (the range of possible weights) is tiny.
  • If the robot is terrible, the "Safe Zone" is huge.
  • The Key Benefit: Even if the robot is terrible, the "Safe Zone" is still valid. You won't be wrong; you'll just be less precise. This is much better than being confidently wrong.

4. The Mathematical Magic: "Optimal Transport"

To calculate this "Safe Zone," the paper uses a mathematical tool called Optimal Transport.

  • The Metaphor: Imagine you have a pile of dirt (the distribution of the robot's guesses in the Training Camp) and a hole to fill (the distribution of the real criminals). You want to move the dirt to fill the hole with the least amount of effort.
  • The paper uses a clever trick to solve this math problem without getting stuck in a computer nightmare. Instead of trying to match every single specific guess to a specific criminal (which is too hard), they look at the overall shape of the piles. This makes the math solvable on a regular computer.

5. The "No-Resampling" Trick: The "Split-Test"

Usually, when statisticians want to be sure their results are real, they use a method called "bootstrapping," which is like running the experiment 1,000 times on a computer to see if the result holds up. This takes forever.

This paper invented a faster way called Cross-Fitting:

  • The Analogy: Imagine you have a deck of cards. You split the deck in half.
    • Group A uses the first half to figure out the rules of the game.
    • Group B uses the second half to test if those rules work.
    • Then, you swap them.
  • By doing this, the researchers can calculate a "confidence score" instantly using standard math tables, without needing to run thousands of simulations. It's like getting a fast, reliable verdict without waiting for a long jury deliberation.

Why This Matters

  • For Economists: You can now use powerful, complex AI tools to measure things like "political bias in news" or "air pollution" without worrying that the AI is slightly off. You get a valid answer with a clear "margin of error."
  • For AI Developers: It changes how we judge AI. We shouldn't just ask, "How accurate is the AI?" We should ask, "Does the AI preserve enough information to help us solve the economic problem?"
  • For Everyone: It shows that even if our tools aren't perfect, we can still get trustworthy answers if we know how to combine our data correctly.

In a nutshell: This paper gives us a new, robust way to use AI in economics. It treats AI predictions not as perfect facts, but as a bridge to connect what we know with what we want to learn, ensuring we never draw a false conclusion, even when the AI is imperfect.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →