Nonstandard Errors in AI Agents

Imagine you hire 150 different detectives to solve the same mystery: "How has the quality of trading for a specific stock (SPY) changed over the last decade?"

You give every detective the exact same pile of evidence (the raw data) and the exact same question. You expect them to all come back with the same answer, right?

Surprisingly, they don't. Some say the market got much better. Some say it got worse. Some say nothing changed at all.

This paper, "Nonstandard Errors in AI Agents," is about what happens when we let Artificial Intelligence (AI) do the work of human researchers. The authors found that AI agents, just like humans, make different choices that lead to wildly different results. They call this "Nonstandard Errors" (NSE).

Here is the story of their experiment, explained simply.

1. The Experiment: 150 Digital Detectives

The researchers set up a massive test using 150 autonomous AI agents (specifically, versions of "Claude Code").

The Task: Analyze 10 years of stock market data for the SPY fund.
The Goal: Test six specific theories (hypotheses) about market trends, like "Did trading volume go up?" or "Did prices become more efficient?"
The Setup: Each agent worked alone. They read the data, wrote their own computer code, picked their own math formulas, and wrote a full research report. No humans touched the code or the data during the process.

2. The Problem: The "Garden of Forking Paths"

When humans do research, they have to make choices. For example, to measure "trading volume," do you count the number of shares traded, or the total dollar value of those shares?

If you count shares, the trend might look like it's going down.
If you count dollars, the trend might look like it's going up.

Both choices are valid, but they tell different stories. The researchers call this a "fork in the road."

The study found that AI agents hit these forks constantly.

The Result: The agents produced a huge range of answers. For the "trading volume" question, the difference between the "best" and "worst" estimate was massive.
The Twist: Unlike humans, who might argue about how to calculate something, the AI agents were actually very consistent about how to do the math (they all used similar regression models). The chaos came entirely from what they chose to measure in the first place.

3. The "Personality" of the AI

Here is where it gets really interesting. The researchers used two different "families" of AI models (let's call them Sonnet and Opus).

Sonnet agents had a specific "style": They loved using one type of math formula (Autocorrelation) and preferred looking at daily data.
Opus agents had a different "style": They almost exclusively used a different formula (Variance Ratio) and liked monthly data.

It turns out, AI models aren't just blank slates. They have embedded biases based on how they were trained. If you ask a Sonnet agent to analyze a stock, it will likely give you a different answer than an Opus agent, not because one is "smarter," but because they have different "methodological personalities."

4. The Feedback Test: Can AI Learn from Each Other?

The researchers tried to fix the disagreement using a three-step process, similar to how human scientists work:

Stage 1: Everyone works alone. (Chaos ensues; huge differences).
Stage 2: The agents read written critiques from other AI agents (Peer Review).
- Result: Nothing changed. The agents read the feedback, made random changes, but the overall disagreement stayed exactly the same. It was like giving a detective a note saying "maybe check the window," and they decided to check the door instead.
Stage 3: The agents were shown the top 5 best-rated reports from the group.
- Result: Massive convergence. Suddenly, almost everyone switched to match the top reports. If the top reports used "Dollar Volume," 90% of the agents switched to "Dollar Volume."

The Catch: This convergence wasn't because the agents suddenly understood the math better. It was because they imitated the winners. If the top reports had chosen the "wrong" measure, the agents would have all converged on the wrong answer together.

5. The Big Takeaway: Why This Matters

This paper warns us about a future where AI writes our economic reports and policy evaluations.

Don't trust a single AI answer: If you ask one AI to analyze a problem, the answer you get depends entirely on which "fork in the road" it happened to take. It's like asking one person to guess the weather; you need a forecast from many sources to get the truth.
AI Peer Review is weak: Just having AI critique other AI doesn't fix the problem.
AI Imitation is dangerous: When AI sees a "good" example, it copies it blindly. It doesn't reason about why that example was good.
The "Lower Bound" Theory: The authors suggest that if AI agents (who share the same training data and logic) can't agree on an answer, then human researchers definitely won't either. The disagreement isn't a bug in the AI; it's a feature of the research question itself. The question was too vague to have a single right answer.

The Metaphor: The "Blind Men and the Elephant"

Imagine a group of blind men touching an elephant.

One touches the leg and says, "It's a tree."
One touches the ear and says, "It's a fan."
One touches the trunk and says, "It's a snake."

If you ask an AI to be the blind men, it will do the same thing. It will touch the "leg" (Dollar Volume) and say "Tree," while another touches the "ear" (Share Volume) and says "Fan."

The paper tells us: We cannot just ask the AI for "The Answer." Instead, we must ask the AI to run a "Multiverse Analysis"—letting 100 different AIs try 100 different ways to solve the problem, and then looking at the whole picture to understand the uncertainty.

In short: AI is a powerful tool, but it is not an oracle. It carries the same ambiguities and biases as the human research it was trained on. To get the truth, we need to look at the whole forest, not just one tree.

1. Problem Statement

The paper investigates a critical gap in the emerging field of automated empirical research: Do state-of-the-art AI coding agents produce consistent results when given the same data and research questions?

While human researchers are known to exhibit "nonstandard errors" (NSE)—variation in results arising from "researcher degrees of freedom" in analytical choices (e.g., measure selection, model specification)—it is unclear if AI agents, which lack human idiosyncrasies and institutional pressures, would converge on a single "best" approach or exhibit similar dispersion. The authors ask whether the stochasticity of language model sampling and the underspecification of empirical tasks lead to significant variation among AI agents, and how this variation responds to feedback mechanisms like peer review and exemplar exposure.

2. Methodology

The authors designed a large-scale, fully autonomous experiment replicating the structure of the Menkveld et al. (2024) "many-analyst" study but replacing human teams with AI agents.

Experimental Setup:
- Agents: 150 autonomous Claude Code agents were deployed.
  - 100 agents used Sonnet 4.6.
  - 50 agents used Opus 4.6.
- Data: 66 GB of NYSE TAQ millisecond trade and quote data for the SPDR S&P 500 ETF (SPY) from 2015–2024 (7 billion rows).
- Task: Agents independently tested six hypotheses regarding market quality trends (e.g., market efficiency, bid-ask spread, trading volume, price impact).
- Autonomy: Agents operated in isolated Singularity containers. They read instructions, explored data, wrote/debugged code, estimated trends, and generated 2,000–4,000-word research reports with figures and code. No human intervention occurred during the analysis.
Three-Stage Feedback Protocol:
1. Stage 1 (Independent): Agents produce initial reports.
2. Stage 2 (Peer Review): Agents receive anonymized written critiques from two AI evaluators (one Sonnet, one Opus) and revise their reports.
3. Stage 3 (Exemplar Exposure): Agents are shown the five highest-rated reports from Stage 2 and allowed a final revision.
Normalization: A conversion pipeline was developed to normalize effect sizes across agents using different functional forms (level vs. log OLS) into a common unit: percentage change per year (%/yr).

3. Key Contributions

Extension of NSE to AI: The paper extends the concept of Nonstandard Errors from human researchers to AI agents, demonstrating that analytical variation persists even when human judgment is removed.
Decomposition of AI NSE: The authors distinguish between variation caused by hypothesis abstraction (ambiguity in the research question, e.g., "trading volume" vs. "dollar volume") and specification choice (functional form, frequency).
Empirical Styles: The study identifies stable, model-specific "empirical styles," showing that different LLM families (Sonnet vs. Opus) have systematic, non-random methodological preferences.
Feedback Mechanism Analysis: The paper provides empirical evidence on how AI agents respond to social scientific feedback, contrasting the ineffectiveness of peer review with the powerful (but potentially blind) convergence driven by exemplar exposure.
Lower Bound Argument: The authors propose that AI NSE serves as a lower bound for human NSE, as AI agents share a common training corpus and architecture, isolating task ambiguity from researcher heterogeneity.

4. Key Results

A. Existence and Structure of AI NSE

Sizable Dispersion: AI agents produced widely varying estimates. For example, the Interquartile Range (IQR) for the Trading Volume (H4) hypothesis was 10.69%/yr, driven entirely by agents choosing between dollar volume (+6.1%/yr) and share volume (-4.6%/yr).
Measure-Choice Forks: Unlike human researchers who vary in estimation paradigms (e.g., relative changes vs. linear trends), AI agents were uniform in their estimation paradigm (99% used OLS with a linear time trend). 90%+ of the AI NSE was concentrated in discrete measure-choice forks (e.g., autocorrelation vs. variance ratio; dollar vs. share volume).
Model-Specific Styles:
- Sonnet 4.6: Strongly preferred autocorrelation (87%) for market efficiency and level OLS.
- Opus 4.6: Universally chose variance ratio (100%) for market efficiency and preferred log OLS.
- These preferences were stable and systematic, not random noise.

B. Impact of Feedback Stages

Stage 1 $\to$ Stage 2 (Peer Review): No convergence. The IQR remained essentially unchanged. Written critiques caused "undirected movement," where agents made idiosyncratic changes that did not systematically reduce dispersion.
Stage 2 $\to$ Stage 3 (Exemplar Exposure): Dramatic convergence (80–99% IQR reduction) for hypotheses where top papers agreed on a measure.
- Mechanism: Agents engaged in measure-family migration. For example, 71% of agents using autocorrelation in Stage 1 switched to variance ratio in Stage 3 after seeing top papers use it.
- Divergence: For hypotheses where top papers disagreed (e.g., H1 and H5), exposure to exemplars increased dispersion by introducing new, conflicting methodological options that agents adopted inconsistently.
Blind Imitation: Agents switched measures to match exemplars without evaluating economic justification. In H4, agents performed a "random shuffle" of volume measures (dollar vs. share) based solely on which top paper they saw, rather than analytical reasoning.

C. Multiverse Analysis

Variance Decomposition: For H4 (Volume), the choice of measure family alone explained 95.4% of the cross-agent variance.
Paradigm Uniformity: AI agents completely avoided "relative change" specifications (period-over-period ratios), a common source of variation in human studies (Menkveld et al., 2024). This suggests AI agents are less flexible in paradigm selection but highly sensitive to measure definition.

5. Significance and Implications

Credibility of Automated Research: A single AI-generated estimate cannot be treated as ground truth. The "which measure" fork alone can flip conclusions from positive to negative.
Ineffectiveness of AI Peer Review: Current AI peer review mechanisms (written critiques) fail to resolve methodological disagreements. They generate movement but not convergence.
Risk of Exemplar Bias: While exposure to top papers reduces variance, it does so through correlated imitation rather than independent validation. If top papers are flawed or split on methodology, AI agents will amplify that split or converge on a potentially incorrect consensus.
AI NSE as a Diagnostic Tool: The authors argue that AI NSE should be preserved rather than eliminated. The variation across AI runs reveals inherent uncertainty in the research question itself. If AI agents disagree, the hypothesis is likely underspecified.
Recommendation: Researchers should adopt Multiverse Analysis as a standard practice when using AI agents. Running multiple agents with different configurations and measure definitions is necessary to map the full distribution of plausible results before committing human resources.

Conclusion

The paper concludes that while AI agents remove human idiosyncrasies, they do not eliminate analytical uncertainty. Instead, they expose the structural ambiguity inherent in empirical research questions. The variation observed is not a bug but a feature reflecting the methodological diversity present in the training literature. Future automated research systems must account for these "Nonstandard Errors" by treating AI outputs as a distribution of possibilities rather than a single point estimate.