Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
This paper introduces the MADQA benchmark and a novel accuracy-effort evaluation protocol to demonstrate that while multimodal agents can match human accuracy on document-based tasks, they rely on inefficient brute-force search rather than genuine strategic reasoning, failing to close the performance gap to oracle levels.
Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta2026-03-13💬 cs.CL