Nonstandard Errors in AI Agents

This study demonstrates that state-of-the-art AI coding agents exhibit significant "nonstandard errors" due to divergent analytical choices when reproducing empirical research, a dispersion that can be drastically reduced by exposure to exemplar papers through imitation rather than genuine understanding.

Ruijiang Gao, Steven Chong Xiao

Published 2026-03-18
📖 6 min read🧠 Deep dive

Imagine you hire 150 different detectives to solve the same mystery: "How has the quality of trading for a specific stock (SPY) changed over the last decade?"

You give every detective the exact same pile of evidence (the raw data) and the exact same question. You expect them to all come back with the same answer, right?

Surprisingly, they don't. Some say the market got much better. Some say it got worse. Some say nothing changed at all.

This paper, "Nonstandard Errors in AI Agents," is about what happens when we let Artificial Intelligence (AI) do the work of human researchers. The authors found that AI agents, just like humans, make different choices that lead to wildly different results. They call this "Nonstandard Errors" (NSE).

Here is the story of their experiment, explained simply.

1. The Experiment: 150 Digital Detectives

The researchers set up a massive test using 150 autonomous AI agents (specifically, versions of "Claude Code").

  • The Task: Analyze 10 years of stock market data for the SPY fund.
  • The Goal: Test six specific theories (hypotheses) about market trends, like "Did trading volume go up?" or "Did prices become more efficient?"
  • The Setup: Each agent worked alone. They read the data, wrote their own computer code, picked their own math formulas, and wrote a full research report. No humans touched the code or the data during the process.

2. The Problem: The "Garden of Forking Paths"

When humans do research, they have to make choices. For example, to measure "trading volume," do you count the number of shares traded, or the total dollar value of those shares?

  • If you count shares, the trend might look like it's going down.
  • If you count dollars, the trend might look like it's going up.

Both choices are valid, but they tell different stories. The researchers call this a "fork in the road."

The study found that AI agents hit these forks constantly.

  • The Result: The agents produced a huge range of answers. For the "trading volume" question, the difference between the "best" and "worst" estimate was massive.
  • The Twist: Unlike humans, who might argue about how to calculate something, the AI agents were actually very consistent about how to do the math (they all used similar regression models). The chaos came entirely from what they chose to measure in the first place.

3. The "Personality" of the AI

Here is where it gets really interesting. The researchers used two different "families" of AI models (let's call them Sonnet and Opus).

  • Sonnet agents had a specific "style": They loved using one type of math formula (Autocorrelation) and preferred looking at daily data.
  • Opus agents had a different "style": They almost exclusively used a different formula (Variance Ratio) and liked monthly data.

It turns out, AI models aren't just blank slates. They have embedded biases based on how they were trained. If you ask a Sonnet agent to analyze a stock, it will likely give you a different answer than an Opus agent, not because one is "smarter," but because they have different "methodological personalities."

4. The Feedback Test: Can AI Learn from Each Other?

The researchers tried to fix the disagreement using a three-step process, similar to how human scientists work:

  1. Stage 1: Everyone works alone. (Chaos ensues; huge differences).
  2. Stage 2: The agents read written critiques from other AI agents (Peer Review).
    • Result: Nothing changed. The agents read the feedback, made random changes, but the overall disagreement stayed exactly the same. It was like giving a detective a note saying "maybe check the window," and they decided to check the door instead.
  3. Stage 3: The agents were shown the top 5 best-rated reports from the group.
    • Result: Massive convergence. Suddenly, almost everyone switched to match the top reports. If the top reports used "Dollar Volume," 90% of the agents switched to "Dollar Volume."

The Catch: This convergence wasn't because the agents suddenly understood the math better. It was because they imitated the winners. If the top reports had chosen the "wrong" measure, the agents would have all converged on the wrong answer together.

5. The Big Takeaway: Why This Matters

This paper warns us about a future where AI writes our economic reports and policy evaluations.

  • Don't trust a single AI answer: If you ask one AI to analyze a problem, the answer you get depends entirely on which "fork in the road" it happened to take. It's like asking one person to guess the weather; you need a forecast from many sources to get the truth.
  • AI Peer Review is weak: Just having AI critique other AI doesn't fix the problem.
  • AI Imitation is dangerous: When AI sees a "good" example, it copies it blindly. It doesn't reason about why that example was good.
  • The "Lower Bound" Theory: The authors suggest that if AI agents (who share the same training data and logic) can't agree on an answer, then human researchers definitely won't either. The disagreement isn't a bug in the AI; it's a feature of the research question itself. The question was too vague to have a single right answer.

The Metaphor: The "Blind Men and the Elephant"

Imagine a group of blind men touching an elephant.

  • One touches the leg and says, "It's a tree."
  • One touches the ear and says, "It's a fan."
  • One touches the trunk and says, "It's a snake."

If you ask an AI to be the blind men, it will do the same thing. It will touch the "leg" (Dollar Volume) and say "Tree," while another touches the "ear" (Share Volume) and says "Fan."

The paper tells us: We cannot just ask the AI for "The Answer." Instead, we must ask the AI to run a "Multiverse Analysis"—letting 100 different AIs try 100 different ways to solve the problem, and then looking at the whole picture to understand the uncertainty.

In short: AI is a powerful tool, but it is not an oracle. It carries the same ambiguities and biases as the human research it was trained on. To get the truth, we need to look at the whole forest, not just one tree.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →