MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

The paper introduces MultiHaystack, a new benchmark comprising over 46,000 multimodal documents, images, and videos to evaluate the critical gap between retrieval and reasoning in multimodal large language models, revealing that current systems struggle significantly when required to locate evidence within large-scale, heterogeneous corpora rather than being provided with it directly.

Dannong Xu, Zhongyu Yang, Jun Chen, Yingfang Yuan, Ming Hu, Lei Sun, Luc Van Gool, Danda Pani Paudel, Chun-Mei Feng

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you are a detective trying to solve a mystery. In the past, AI researchers gave detectives a single, small box containing only the most relevant clues (a few photos, one document, or a short video clip) and asked, "Can you figure out the answer?" The AI was usually great at this because the clues were right in front of it.

But in the real world, a detective doesn't get a pre-sorted box. They get a giant warehouse filled with 46,000 items: thousands of videos, millions of pages of documents, and countless photos. They have to find the one specific piece of evidence hidden in that massive pile before they can even start solving the mystery.

This paper introduces MultiHaystack, a new "test" designed to see if AI can actually do this real-world job.

Here is the breakdown of what the paper found, using simple analogies:

1. The Problem: The "Small Box" Trap

Most current AI benchmarks are like giving a detective a single, labeled envelope and asking, "Who is the killer?" The AI gets a perfect score because it didn't have to search; the answer was handed to it.

  • The Reality: Real life is a "Needle in a Haystack" problem, but the haystack is huge, and the needle is made of different materials (some are videos, some are text, some are images).
  • The Flaw: Previous tests only used small haystacks (maybe 100 items) or only one type of item (only text). This made the search too easy and made the AI look smarter than it really is.

2. The Solution: MultiHaystack

The authors built a massive, realistic test environment:

  • The Haystack: Over 46,000 items mixed together (videos, images, and documents).
  • The Needle: 747 specific questions where the answer is hidden in exactly one specific item in that pile.
  • The Twist: The AI has to do two things:
    1. Find the needle: Search the 46,000 items to find the right one.
    2. Solve the puzzle: Read that specific item and answer the question.

3. The Shocking Results

The researchers tested the world's smartest AI models (like GPT-5 and others) on this new test, and the results were a wake-up call:

  • Scenario A (The Cheat Code): When the researchers gave the AI the exact right document or video clip to look at, the AI was brilliant. It solved about 80% of the questions correctly.
  • Scenario B (The Real World): When the AI had to search the 46,000-item pile first, its performance crashed.
    • The best AI models dropped from 80% accuracy down to roughly 50%.
    • Even the best "search engines" (retrievers) could only find the right needle in the haystack about 40% of the time on the first try.

The Metaphor: Imagine a student who is a genius at math. If you give them the right textbook page, they solve the problem instantly. But if you tell them, "Go find the answer in this library of 46,000 books," they wander around, pick the wrong book, and fail the test. The problem isn't that they can't do the math; it's that they can't find the book.

4. Why Is This So Hard?

The paper found that the difficulty comes from the "mix" of items:

  • Cross-Modal Confusion: The AI gets confused when it has to search for a video using a text question, or find a document using an image. It's like trying to find a specific song by humming a tune, but the library only has sheet music and vinyl records. The AI struggles to match the "vibe" across different formats.
  • The "Distractor" Effect: The test includes items that look similar but are wrong. For example, if you ask about a specific news report from 1994, the AI might grab a news report from 1995 that looks almost identical. It gets tricked by surface-level similarities.

5. What Does This Mean for the Future?

The paper concludes that retrieval (finding the info) is currently the biggest bottleneck for AI, not reasoning (thinking about the info).

  • The Bottleneck: We are building incredibly smart "brains" (reasoning models), but we haven't built good enough "librarians" (search systems) to feed them the right information.
  • The Path Forward: To make AI truly useful in the real world, we need to stop testing them on small, easy datasets. We need to build systems that are better at navigating massive, messy, mixed-media libraries before they try to answer complex questions.

In short: The paper says, "Stop praising the AI for being smart when the answer is handed to it. Let's see if it can actually find the answer in the real world first." And right now, it's still struggling to find its way out of the haystack.