Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

This paper introduces the MADQA benchmark and a novel accuracy-effort evaluation protocol to demonstrate that while multimodal agents can match human accuracy on document-based tasks, they rely on inefficient brute-force search rather than genuine strategic reasoning, failing to close the performance gap to oracle levels.

Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta

Published 2026-03-13
📖 5 min read🧠 Deep dive

The Great Document Detective Contest: Strategy vs. Guessing

Imagine you are hired to solve a mystery. You are handed a massive, dusty library containing 800 different books, reports, and manuals. Some are thin pamphlets; others are 800-page legal contracts. Your job is to answer specific questions like, "Which restaurant has a lower instructor-to-student ratio: the Firearms range or the New Mexico Justice System?"

To do this, you can't just guess. You have to find the right pages, read the fine print, look at the charts, and piece together clues from different books to get the answer.

This is exactly what the new paper, MADQA, is all about. It's a giant test designed to see if modern AI "agents" (smart computer programs) are actually strategic detectives or just lucky guessers throwing darts at a wall.

Here is the breakdown of the paper in simple terms:

1. The Problem: Are They Smart or Just Loud?

For a long time, we've asked AI to read documents. But most tests were too easy. They were like asking, "What is the capital of France?" while holding a map of France. The AI just memorized the map.

The researchers wanted to know: When the AI has to hunt for information in a messy, complex library, does it have a plan?

  • The Ideal: A human detective who knows to check the "Financials" section first, then cross-reference it with the "HR" section.
  • The Reality: Many AIs act like a frantic squirrel. They search, fail, search again, search harder, and eventually stumble on the answer by sheer volume of effort, not by being smart.

2. The Solution: The "MADQA" Library

The team built a new, super-hard test called MADQA (Multimodal Agentic Document QA).

  • The Collection: 800 real-world PDFs (tax forms, menus, legal filings, technical manuals).
  • The Questions: 2,250 questions written by humans. These aren't simple "find the word" tasks. They require Multi-Hop Reasoning.
    • Example: "Find the budget for Project A in 2022, find the budget for Project B in 2023, and tell me which one grew faster."
    • This forces the AI to jump between documents, read tables, and understand layouts (like knowing that a number in a chart means something different than a number in a sentence).

3. The New Scorecard: Accuracy vs. Effort

The researchers didn't just ask, "Did you get the answer right?" They also asked, "How hard did you work for it?"

They introduced a new way to measure Calibration.

  • The Good Detective: Asks one or two smart questions, finds the answer, and stops. (High accuracy, low effort).
  • The Bad Detective: Asks 50 questions, gets lost in loops, spends hours searching, and maybe finds the answer. (High accuracy, massive effort).

They used a metric called the Kuiper Statistic (don't worry about the name!) to measure this. Think of it like a "wasted energy" meter. If the AI keeps searching even when it's failing, the meter goes up. If it knows when to stop, the meter stays low.

4. The Shocking Results

The team tested the world's best AI models against human experts. Here is what they found:

  • The "Oracle Gap": Even the smartest AI (like Gemini 3 Pro) only got about 82% of the questions right. Humans, using the same search tools, got 99% right. There is still a huge gap. The AI is missing the "last mile" of understanding.
  • Brute Force vs. Strategy: The best AI models matched human accuracy only because they were allowed to search 10 times. Humans found the answer on the first try 50% of the time. The AI started at 12% and had to keep searching to catch up.
  • Different Mistakes:
    • Humans made mistakes because they got tired or missed a "not" in a sentence (e.g., reading "do not allow" as "allow").
    • AI made mistakes because it couldn't find the right document in the first place. It was searching the wrong aisle of the library.
  • The "Cold Start" Problem: Humans are great at guessing the right search term immediately. AI models often start with a terrible search term and have to "recover" by searching frantically.

5. Why This Matters

This paper is a wake-up call. It tells us that current AI isn't truly "thinking" its way through complex documents yet. It's mostly stochastic search (random trial and error) disguised as intelligence.

The Analogy:
Imagine you are looking for a specific needle in a haystack.

  • The Human looks at the shape of the haystack, smells the hay, and uses a magnet to find the needle in 10 seconds.
  • The Current AI grabs a shovel, starts digging randomly, digs 100 holes, gets tired, and eventually finds the needle in the 101st hole. It found the needle, but it wasted a lot of energy and time.

The Takeaway

The researchers are releasing this library and the test tools to the public. Their goal is to push AI developers to stop building "brute-force" searchers and start building calibrated, efficient thinkers that know when to stop searching and how to plan their next move.

We are moving from the era of "Can the AI read the document?" to "Can the AI think like a detective?" And right now, the AI is still a very eager, but very inefficient, intern.