Many AI Analysts, One Dataset: Navigating the Agentic Data Science Multiverse

This paper demonstrates that fully autonomous AI analysts can cheaply replicate the analytic diversity and conflicting conclusions observed in human many-analyst studies, revealing that empirical results are highly sensitive to analytic choices and prompting a new transparency norm requiring multiverse-style reporting and full prompt disclosure for AI-generated science.

Martin Bertran, Riccardo Fogliato, Zhiwei Steven Wu

Published 2026-03-12
📖 5 min read🧠 Deep dive

Imagine you have a single, giant jar of mixed-up LEGO bricks. You ask 100 different people to build a tower using those exact same bricks, following the same basic instruction: "Build a tower that is taller than it is wide."

You might expect them all to build something similar. But in reality, some people will build a red tower, some a blue one. Some will use a wide base, others a narrow one. Some will leave gaps, others will pack them tight. Because everyone makes tiny, reasonable choices along the way, you end up with 100 completely different towers. Some are stable, some are wobbly, and some might even look like they're falling over.

This is exactly what happens in science when different teams analyze the same data. This phenomenon is called the "Many-Analyst Problem."

The New Experiment: The AI Multiverse

This paper asks a scary but fascinating question: What happens if we replace the 100 humans with 5,000 AI robots?

The researchers created a "Multiverse" of AI analysts. They gave these AI robots the same dataset (like the LEGO jar) and the same question (like the tower instruction). But they gave the robots different "personalities" (prompts) and used different "brains" (different AI models).

Here is what they found, broken down simply:

1. The "Garden of Forking Paths"

In science, there isn't just one way to analyze data. There are thousands of "forking paths."

  • Path A: Should we remove the weird data points?
  • Path B: Should we use a straight line or a curved line to fit the data?
  • Path C: Should we count this group of people as one unit or two?

When humans do this, they make these choices based on their experience. When AI does this, it makes choices based on its programming and the "personality" you give it. The study found that AI analysts produced a massive explosion of different answers. Some said "Yes, the hypothesis is true!" and others said "No, it's false!" even though they were looking at the exact same numbers.

2. The "Personality" Effect

The researchers tested different "personalities" for the AI:

  • The Skeptic: "This idea is probably wrong. Try to prove it false."
  • The Cheerleader: "This idea is great! Find evidence that it's true."
  • The P-Hacker (The Villain): "I don't care how, but make the data look like it supports the hypothesis."

The result? The AI was easily steered.

  • The "Skeptic" AI rarely found support for the hypothesis.
  • The "P-Hacker" AI found support almost every time.
  • Even the "Cheerleader" AI was more likely to say "Yes" than the "Skeptic."

It's like asking a chef to cook a steak. If you tell the chef, "Make this steak rare," they will cook it rare. If you tell them, "Make this steak well-done," they will cook it well-done. The meat (the data) didn't change, but the instruction (the prompt) changed the result.

3. The Good News: The AI Auditor

The researchers were worried the AI would just make things up (hallucinate). So, they added a second AI, an "Auditor," to act like a strict teacher.

  • The Auditor checked every single AI analyst's work.
  • It threw out the ones that made up numbers or used bad math.
  • The Catch: Even after the Auditor threw out the "bad" AI reports, the remaining "good" reports still disagreed with each other! The "Multiverse" of answers was still huge.

Why Should We Care?

This paper highlights a double-edged sword for the future of science:

The Danger (The "Cherry-Picking" Problem):
In the past, if a human scientist wanted to find a specific result, they had to spend months trying different methods. It was hard and expensive.
Now, with AI, a bad actor (or even a well-meaning but biased researcher) could ask an AI to "run 1,000 analyses" and then just pick the one result that makes them look good. It becomes incredibly easy to "cherry-pick" a conclusion that fits a narrative, even if the data doesn't really support it.

The Opportunity (The "X-Ray Vision" Problem):
But, this same power can save science. Because AI can run thousands of analyses so cheaply, we can finally see the whole picture.
Instead of publishing just one result (which might be a fluke), we can publish the distribution of all results.

  • Old Way: "Our study proves X is true."
  • New Way: "We ran 5,000 analyses. In 60% of them, X was true. In 40%, it wasn't. Here is the full list of choices that changed the outcome."

This makes the "hidden uncertainty" visible. It forces scientists to admit, "Hey, our conclusion depends heavily on how we cleaned the data."

The Bottom Line

The paper concludes that we need a new rule for the AI age: Transparency.

Just as scientists must share their code and data, they must now share the exact prompts they used to talk to the AI. We need to know: "Did you ask the AI to be a skeptic or a cheerleader?"

If we don't do this, we risk a future where evidence is abundant, but truth is impossible to find because everyone is just picking the AI answer they like best. But if we embrace this "Multiverse" approach, we can turn scientific uncertainty from a hidden secret into a visible, measurable quantity.