The Great Document Detective Contest: Strategy vs. Guessing

Imagine you are hired to solve a mystery. You are handed a massive, dusty library containing 800 different books, reports, and manuals. Some are thin pamphlets; others are 800-page legal contracts. Your job is to answer specific questions like, "Which restaurant has a lower instructor-to-student ratio: the Firearms range or the New Mexico Justice System?"

To do this, you can't just guess. You have to find the right pages, read the fine print, look at the charts, and piece together clues from different books to get the answer.

This is exactly what the new paper, MADQA, is all about. It's a giant test designed to see if modern AI "agents" (smart computer programs) are actually strategic detectives or just lucky guessers throwing darts at a wall.

Here is the breakdown of the paper in simple terms:

1. The Problem: Are They Smart or Just Loud?

For a long time, we've asked AI to read documents. But most tests were too easy. They were like asking, "What is the capital of France?" while holding a map of France. The AI just memorized the map.

The researchers wanted to know: When the AI has to hunt for information in a messy, complex library, does it have a plan?

The Ideal: A human detective who knows to check the "Financials" section first, then cross-reference it with the "HR" section.
The Reality: Many AIs act like a frantic squirrel. They search, fail, search again, search harder, and eventually stumble on the answer by sheer volume of effort, not by being smart.

2. The Solution: The "MADQA" Library

The team built a new, super-hard test called MADQA (Multimodal Agentic Document QA).

The Collection: 800 real-world PDFs (tax forms, menus, legal filings, technical manuals).
The Questions: 2,250 questions written by humans. These aren't simple "find the word" tasks. They require Multi-Hop Reasoning.
- Example: "Find the budget for Project A in 2022, find the budget for Project B in 2023, and tell me which one grew faster."
- This forces the AI to jump between documents, read tables, and understand layouts (like knowing that a number in a chart means something different than a number in a sentence).

3. The New Scorecard: Accuracy vs. Effort

The researchers didn't just ask, "Did you get the answer right?" They also asked, "How hard did you work for it?"

They introduced a new way to measure Calibration.

The Good Detective: Asks one or two smart questions, finds the answer, and stops. (High accuracy, low effort).
The Bad Detective: Asks 50 questions, gets lost in loops, spends hours searching, and maybe finds the answer. (High accuracy, massive effort).

They used a metric called the Kuiper Statistic (don't worry about the name!) to measure this. Think of it like a "wasted energy" meter. If the AI keeps searching even when it's failing, the meter goes up. If it knows when to stop, the meter stays low.

4. The Shocking Results

The team tested the world's best AI models against human experts. Here is what they found:

The "Oracle Gap": Even the smartest AI (like Gemini 3 Pro) only got about 82% of the questions right. Humans, using the same search tools, got 99% right. There is still a huge gap. The AI is missing the "last mile" of understanding.
Brute Force vs. Strategy: The best AI models matched human accuracy only because they were allowed to search 10 times. Humans found the answer on the first try 50% of the time. The AI started at 12% and had to keep searching to catch up.
Different Mistakes:
- Humans made mistakes because they got tired or missed a "not" in a sentence (e.g., reading "do not allow" as "allow").
- AI made mistakes because it couldn't find the right document in the first place. It was searching the wrong aisle of the library.
The "Cold Start" Problem: Humans are great at guessing the right search term immediately. AI models often start with a terrible search term and have to "recover" by searching frantically.

5. Why This Matters

This paper is a wake-up call. It tells us that current AI isn't truly "thinking" its way through complex documents yet. It's mostly stochastic search (random trial and error) disguised as intelligence.

The Analogy:
Imagine you are looking for a specific needle in a haystack.

The Human looks at the shape of the haystack, smells the hay, and uses a magnet to find the needle in 10 seconds.
The Current AI grabs a shovel, starts digging randomly, digs 100 holes, gets tired, and eventually finds the needle in the 101st hole. It found the needle, but it wasted a lot of energy and time.

The Takeaway

The researchers are releasing this library and the test tools to the public. Their goal is to push AI developers to stop building "brute-force" searchers and start building calibrated, efficient thinkers that know when to stop searching and how to plan their next move.

We are moving from the era of "Can the AI read the document?" to "Can the AI think like a detective?" And right now, the AI is still a very eager, but very inefficient, intern.

1. Problem Statement

The paper addresses a critical gap in evaluating Multimodal Large Language Model (MLLM) agents: Do these agents demonstrate genuine strategic reasoning, or are they merely performing stochastic trial-and-error searches?

Current benchmarks often fail to distinguish between these behaviors because they rely on:

Single-step retrieval: Tasks solvable by a single query.
Text-only or HTML inputs: Ignoring the visual complexity (layouts, tables, figures) of real-world enterprise documents.
Automated or recycled data: Introducing bias toward specific models or allowing data contamination.

The authors argue that existing benchmarks do not adequately test Agentic Document Collection Visual Question Answering (VQA), a task requiring iterative planning, navigation across disjoint pages/documents, and visual comprehension to synthesize answers from heterogeneous PDF collections.

2. Methodology: The MADQA Benchmark

To solve this, the authors introduce MADQA (Multimodal Agentic Document QA), a benchmark designed with rigorous psychometric principles.

A. Dataset Construction

Scale: 2,250 human-authored questions over 800 heterogeneous PDF documents (12.2M tokens).
Source: Documents are manually curated from DocumentCloud, covering 13 high-level domains (Financial, Legal, Government, HR, etc.) and 63 fine-grained categories.
Diversity: The corpus includes a mix of scanned and born-digital PDFs with varying layouts (forms, tables, charts, handwritten notes).
Annotation Protocol:
- Strict Grounding: Answers must be extractive (tokens appear in the document) and derived solely from the corpus (Closed-World assumption).
- Multi-Hop Design: ~17.3% of questions require cross-page or cross-document reasoning.
- Human Baseline: 1,200+ hours of professional annotation. Annotators used the same search interface as the agents to ensure fair comparison of retrieval efficiency.
- Quality Control: A two-step verification process involving GPT-5 oracle checks and human expert review to eliminate ambiguity and ensure solvability.

B. Task Formalization

The task is defined by six core properties distinguishing it from standard RAG:

Extractive: Answers must be grounded in evidence tokens.
Multi-Hop: Evidence may span disjoint pages or documents.
Closed-World: No external parametric knowledge allowed.
Grounded: Minimal evidence set required (no superfluous pages).
Agentic: Cannot be solved by a single retrieval query; requires planning and iteration.
Visual: Requires understanding non-textual elements (layout, tables, artifacts).

C. Evaluation Protocol

The authors propose a novel evaluation framework focusing on Accuracy-Effort Trade-offs:

Accuracy: Measured via an LLM-as-a-Judge calibrated to human judgments (focusing on semantic equivalence rather than strict string matching).
Attribution: Measured via Page F1 and Doc F1 to distinguish between finding the right document vs. the right page.
Efficiency & Calibration: A novel metric based on the Kuiper Statistic (derived from Cumulative Difference curves). It measures whether an agent's accuracy improves as it invests more effort (steps). A low Kuiper score indicates stable performance; a high score indicates "diminishing returns" or unproductive loops.

D. Dataset Splits

Using Classical Test Theory (CTT), the dataset is split into:

Train (1,550): For RL-based optimization.
Dev (200): For tuning.
Test (500): Selected to maximize discrimination between models.
Sentinel Pool (20% of test): The hardest items (solved by <10% of models) reserved to prevent benchmark saturation as models improve.

3. Key Contributions

MADQA Benchmark: The first fully human-annotated, multimodal benchmark for agentic reasoning over complex PDF collections, addressing limitations in format, scope, and data integrity of prior works.
Construct Validity Framework: Operationalized methods to prove the benchmark measures reasoning rather than lexical overlap or memorization (showing only ~11% of answers can be guessed without documents).
Efficiency Calibration Metric: Introduction of the Kuiper statistic to quantify the "effort-invariance" of agents, revealing that many agents waste compute on unproductive search loops.
Human-Agent Comparative Study: The first detailed comparison of human vs. agent search behaviors, revealing fundamental differences in competency and strategy.

4. Key Results & Analysis

A. Performance vs. Oracle Gap

Top Agents: The best agentic system (Gemini 3 Pro BM25 Agent) achieves 82.2% accuracy.
Oracle Gap: Even with perfect retrieval tools, humans achieve 99.4% accuracy. There remains a ~17-18% gap that current agents cannot close, primarily due to retrieval bottlenecks and inability to navigate complex layouts.
Static RAG vs. Agents: Agentic systems (iterative search) significantly outperform static RAG (single-shot retrieval), confirming the value of planning.

B. Efficiency and Calibration

Stochastic Search: Many agents exhibit poor calibration. They often persist in "unproductive loops," expending massive compute (e.g., Claude Sonnet 4.5 RLM processed 270M tokens for $850) without matching the accuracy of constrained agents.
Human Superiority: Humans are far more efficient. They achieve ~50% accuracy on their first query (Cold Start), whereas top agents start at ~12%. Humans also have a much lower Kuiper score (14.6 vs. 22.9–73.2 for agents), indicating better effort allocation.

C. Error Taxonomy

Failure Modes: Errors shift as models improve. Weak models fail at Retrieval (35.7% of errors) or Refusal (12.6%). Stronger models solve retrieval but fail at Comprehension (28.8%) or Navigation (23.0%).
Semantic vs. Physical Distance: For multi-hop questions, semantic distance (conceptual dissimilarity) is a stronger predictor of failure than physical page distance.
Complementarity: Humans and top agents succeed on different questions (Cohen's $\kappa$ = 0.24). Humans fail more on attention-heavy extractions; agents fail on retrieval and complex reasoning. This suggests hybrid human-agent pipelines could exceed current ceilings.

D. Visual Necessity

58% of questions require understanding structured layouts, tables, or visual artifacts. Pure text extraction is insufficient; visual encoders or layout-aware parsing are critical.

5. Significance and Future Directions

Paradigm Shift: The paper argues that the field must move from "brute-force retrieval" to calibrated, efficient reasoning. Current frontier models often rely on stochastic search rather than strategic planning.
Architectural Insights:
- Memory: Agents lack episodic memory to learn corpus-specific terminology, leading to repeated failures.
- RL with Tool Feedback: Reinforcement learning using search tool feedback could significantly improve exploration policies.
- Constrained Agency: Constrained iterative agents (e.g., BM25 + MLLM) outperform unconstrained Recursive Language Models (RLMs) in both cost and accuracy.
Community Impact: The release of MADQA, the evaluation harness, and the "Sentinel Pool" ensures the benchmark remains a discriminative signal for future model development, preventing premature saturation.

In conclusion, MADQA reveals that while agents can match human accuracy on specific tasks, they lack human efficiency and strategic calibration. The path forward lies in improving retrieval planning, reducing stochastic search behaviors, and integrating episodic memory to handle complex, multi-document workflows.

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections