MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

Imagine you are a detective trying to solve a mystery. In the past, AI researchers gave detectives a single, small box containing only the most relevant clues (a few photos, one document, or a short video clip) and asked, "Can you figure out the answer?" The AI was usually great at this because the clues were right in front of it.

But in the real world, a detective doesn't get a pre-sorted box. They get a giant warehouse filled with 46,000 items: thousands of videos, millions of pages of documents, and countless photos. They have to find the one specific piece of evidence hidden in that massive pile before they can even start solving the mystery.

This paper introduces MultiHaystack, a new "test" designed to see if AI can actually do this real-world job.

Here is the breakdown of what the paper found, using simple analogies:

1. The Problem: The "Small Box" Trap

Most current AI benchmarks are like giving a detective a single, labeled envelope and asking, "Who is the killer?" The AI gets a perfect score because it didn't have to search; the answer was handed to it.

The Reality: Real life is a "Needle in a Haystack" problem, but the haystack is huge, and the needle is made of different materials (some are videos, some are text, some are images).
The Flaw: Previous tests only used small haystacks (maybe 100 items) or only one type of item (only text). This made the search too easy and made the AI look smarter than it really is.

2. The Solution: MultiHaystack

The authors built a massive, realistic test environment:

The Haystack: Over 46,000 items mixed together (videos, images, and documents).
The Needle: 747 specific questions where the answer is hidden in exactly one specific item in that pile.
The Twist: The AI has to do two things:
1. Find the needle: Search the 46,000 items to find the right one.
2. Solve the puzzle: Read that specific item and answer the question.

3. The Shocking Results

The researchers tested the world's smartest AI models (like GPT-5 and others) on this new test, and the results were a wake-up call:

Scenario A (The Cheat Code): When the researchers gave the AI the exact right document or video clip to look at, the AI was brilliant. It solved about 80% of the questions correctly.
Scenario B (The Real World): When the AI had to search the 46,000-item pile first, its performance crashed.
- The best AI models dropped from 80% accuracy down to roughly 50%.
- Even the best "search engines" (retrievers) could only find the right needle in the haystack about 40% of the time on the first try.

The Metaphor: Imagine a student who is a genius at math. If you give them the right textbook page, they solve the problem instantly. But if you tell them, "Go find the answer in this library of 46,000 books," they wander around, pick the wrong book, and fail the test. The problem isn't that they can't do the math; it's that they can't find the book.

4. Why Is This So Hard?

The paper found that the difficulty comes from the "mix" of items:

Cross-Modal Confusion: The AI gets confused when it has to search for a video using a text question, or find a document using an image. It's like trying to find a specific song by humming a tune, but the library only has sheet music and vinyl records. The AI struggles to match the "vibe" across different formats.
The "Distractor" Effect: The test includes items that look similar but are wrong. For example, if you ask about a specific news report from 1994, the AI might grab a news report from 1995 that looks almost identical. It gets tricked by surface-level similarities.

5. What Does This Mean for the Future?

The paper concludes that retrieval (finding the info) is currently the biggest bottleneck for AI, not reasoning (thinking about the info).

The Bottleneck: We are building incredibly smart "brains" (reasoning models), but we haven't built good enough "librarians" (search systems) to feed them the right information.
The Path Forward: To make AI truly useful in the real world, we need to stop testing them on small, easy datasets. We need to build systems that are better at navigating massive, messy, mixed-media libraries before they try to answer complex questions.

In short: The paper says, "Stop praising the AI for being smart when the answer is handed to it. Let's see if it can actually find the answer in the real world first." And right now, it's still struggling to find its way out of the haystack.

1. Problem Statement

Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in understanding text, images, and videos in isolation. However, a critical gap exists in evaluating their ability to perform retrieval-augmented reasoning in realistic, large-scale, heterogeneous environments. Existing benchmarks suffer from three primary limitations:

Unrealistic Scale: Most benchmarks restrict the candidate pool to hundreds or thousands of items, making retrieval trivial and inflating accuracy scores.
Modality Silos: Many datasets focus on a single modality (e.g., only documents or only images), failing to test cross-modal retrieval where a query might require finding a video segment based on a text description or an image.
Ambiguous Ground Truth: Questions often lack unique, verifiable evidence, allowing models to guess correctly based on vague associations rather than precise retrieval.

The paper argues that current evaluations mask the true bottleneck of MLLMs: the inability to locate specific, fine-grained evidence within massive, mixed-modality corpora before reasoning can occur.

2. Methodology: MultiHaystack Benchmark Construction

The authors introduce MultiHaystack, the first large-scale benchmark designed to evaluate both retrieval and reasoning under cross-modal conditions.

Data Composition

Scale: The benchmark comprises 46,260 multimodal items (25,652 images, 10,419 videos, 10,189 documents) and 747 evidence-grounded questions.
Unique Evidence Constraint: Each question is anchored to exactly one unique item in the corpus that contains the verifiable answer. This eliminates ambiguity and ensures that retrieval errors can be strictly separated from reasoning errors.
Task Types: The questions are categorized into six distinct reasoning tasks to test diverse capabilities:
1. Visual Parsing & Positioning (VPP): Spatial grounding of objects.
2. Contextual Understanding (CU): Integrating embedded text/symbols with context.
3. Video Temporal Reasoning (VTR): Inferring motion and temporal order.
4. Statistical Reasoning (SR): Extracting data from charts/tables.
5. Metadata Identification (MI): Identifying timestamps, affiliations, etc.
6. Factual Knowledge Retrieval (FKR): Synthesizing corpus-grounded facts.

Construction Pipeline

The dataset was built through a four-stage process:

Data Collection: Aggregated from diverse sources (e.g., DocHaystack, VideoVista, MINT1T) covering PDFs, web images, and YouTube videos.
Question Generation: Used GPT-4o to generate ~30 candidate questions per item, normalized into image-based representations (e.g., video frames, PDF pages).
Filtering: Applied automated and manual filters to remove ambiguous questions, ensure unique anchors, and verify that questions cannot be answered without retrieving the specific item (retrieval-independence test).
Data Enrichment: To simulate real-world "needle-in-a-haystack" difficulty, the authors added hard negatives. Distractor candidates were generated using keyword-based web scraping and filtered by CLIP similarity to ensure semantic plausibility without containing the answer, expanding the pool to 46K+ items.

3. Experimental Setup

The authors evaluated a wide range of state-of-the-art models:

Retrieval Models: CLIP-based models (CLIP, SigLIP2, OpenCLIP), specialized retrievers (Jina-CLIP, NEV, E5-V, MM-Embed).
Reasoning Models (MLLMs): Open-source models (Ola, InternVL-3, Qwen2-VL) and proprietary models (GPT-5, Gemini-2.5-Flash).
Evaluation Metrics:
- Retrieval: Recall@1, @3, @5 (Item-level).
- Reasoning: Accuracy of the final answer, evaluated using GPT-4o-mini as a judge (validated against human annotation).
Settings:
- Gold: Providing the correct item directly (upper bound).
- Single-Modality: Retrieving from a pool of only one modality type.
- Cross-Modality: Retrieving from the full heterogeneous pool.

4. Key Results

The experiments reveal a stark performance gap between reasoning with provided evidence and reasoning with retrieved evidence.

Retrieval Performance

Cross-Modal Gap: Even the strongest retrievers (E5-V, SigLIP2) see a massive performance drop when moving from single-modality to cross-modal retrieval.
- Example: E5-V achieves 72.42% Recall@1 on a 1K single-modality pool but drops to 40.83% on the full 46K cross-modal pool.
- Single-modality retrieval is often >90% Recall@5, whereas cross-modal rarely exceeds 60%.
Task Sensitivity: Retrieval fails most significantly on tasks requiring fine-grained entity matching (Factual Knowledge, Statistical Reasoning) compared to global visual similarity tasks (Visual Parsing).

Reasoning Performance

Retrieval Dependency: Reasoning accuracy is highly dependent on retrieval quality.
- GPT-5: Drops from 80.86% accuracy (Gold evidence) to 51.4% accuracy (Top-5 cross-modal retrieval).
- Gemini-2.5-Flash: Drops from 50.87% to 43.64%.
Error Propagation: Retrieval errors directly propagate to reasoning failures. When the correct evidence is retrieved, models perform competitively; when it is missed, performance collapses.
Scale Effect: As the candidate pool size increases (1K $\to$ 10K $\to$ 46K), recall degrades consistently, confirming that current models struggle with scale and heterogeneity.

Error Analysis

Retrieval Errors: Dominated by modality bias (retrieving images instead of videos) and semantic drift (focusing on global visual similarity rather than temporal or specific constraints).
Reasoning Errors: Even with correct retrieval, models struggle with visual numeracy (misreading chart values) and layout-aware multi-step reasoning (integrating scattered information).

5. Key Contributions

MultiHaystack Benchmark: A large-scale (46K+ items), cross-modal benchmark with unique, verifiable evidence for 747 questions, addressing the scale and ambiguity limitations of prior work.
Diagnostic Framework: A methodology to decouple and evaluate retrieval and reasoning stages separately, revealing that retrieval is the primary bottleneck for MLLMs in open-domain settings.
Empirical Evidence: Comprehensive results showing that current SOTA models (including GPT-5) suffer significant performance degradation in cross-modal, large-scale retrieval scenarios, highlighting the need for retrieval-centric architectural advances.
Open Resources: Release of code, dataset, and evaluation protocols to facilitate reproducible research in multimodal RAG (Retrieval-Augmented Generation).

6. Significance

MultiHaystack fundamentally shifts the evaluation paradigm for MLLMs. It demonstrates that high performance on isolated VQA tasks does not translate to real-world reliability where evidence must be found first. The benchmark exposes that multimodal retrieval over heterogeneous pools is the critical frontier for future AI development. It suggests that simply scaling model size or context windows is insufficient; future systems require better cross-modal alignment, modality-agnostic representations, and tighter coupling between retrieval and reasoning mechanisms. The paper serves as a call to action for the community to prioritize retrieval robustness in heterogeneous environments.