BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence

Imagine you are a detective trying to solve a complex mystery. You have a massive, 100-page case file (a long scientific paper) that contains clues scattered everywhere: paragraphs of text, dense spreadsheets of data, and colorful charts.

Your job isn't just to find the answer to a simple question like "What color is the suspect's car?" (which you could find on page 1). Instead, you need to solve a multi-hop mystery.

For example: "Based on the chart on page 45, which shows the crime rate, and the text on page 78, which explains the new law, did the new law actually reduce crime in the specific neighborhood mentioned in the table on page 12?"

To solve this, you have to:

Read the text to understand the law.
Look at the table to find the specific neighborhood.
Check the chart to see the crime numbers.
Connect all three pieces of information to form a conclusion.

The Problem: The "Smart" Detective Who Skips Steps

Recently, we've built very smart AI detectives (Large Language Models). They are great at reading and answering questions. But most tests we give them are like asking, "What's the suspect's name?" and checking if the AI got the name right.

The problem is, these AI detectives often cheat. They might guess the answer based on a keyword they saw, or they might skip the hard work of connecting the dots. They get the final answer right by luck, but they didn't actually do the reasoning.

Furthermore, when the clues are in a spreadsheet or a chart (multimodal), the AI often gets confused. It might ignore the chart entirely and just guess based on the text, or it might get lost in a 100-page document and miss the clue on page 90.

The Solution: BRIDGE

The authors of this paper created a new test called BRIDGE. Think of BRIDGE as a "Maze of Truth" designed specifically to catch AI detectives who are cheating or skipping steps.

Here is what makes BRIDGE special:

It's a Long, Messy Case File: Unlike previous tests that used short, simple stories, BRIDGE uses real, long scientific papers. The clues are hidden deep inside, requiring the AI to read the whole document.
It Mixes Clue Types: The clues aren't just words. They are a mix of text, tables (spreadsheets), and figures (charts). The AI has to be fluent in all three languages to solve the puzzle.
It Grades the "Thinking Process," Not Just the Answer: This is the most important part. In school, if you get the right answer but show no work, you might still get an A. In BRIDGE, the teachers (the evaluators) check your step-by-step reasoning.
- Did you actually look at the chart?
- Did you connect the table to the text correctly?
- Or did you just hallucinate (make up) a connection?
- BRIDGE gives you a grade for how you got there, not just what you got.

What They Found (The Plot Twist)

The researchers tested their "smartest" AI detectives on this new BRIDGE maze. Here is what happened:

The "Direct" Approach: When the AI was allowed to read the whole document at once, it did okay, but it still made mistakes connecting the dots.
The "Retrieval" Approach (The RAG Trap): Usually, when documents are too long, we use a system to "search" for the relevant pages first (like a librarian finding the right book for you). The researchers tried this with a tool called ColPali.
- The Result: It was a disaster. The AI's performance crashed.
- Why? The librarian (the search tool) kept handing the AI the wrong pages or missing pages entirely. Because the AI couldn't find the specific clue on page 90, it couldn't solve the mystery. It showed that even our best search tools struggle to find specific evidence in long, complex documents.

The Takeaway

BRIDGE is a wake-up call. It tells us that just because an AI can answer a question correctly doesn't mean it understands the document.

It's like a student who memorizes the answer key but doesn't know how to do the math. BRIDGE forces the AI to show its homework, proving that it can actually navigate the messy, multi-page, chart-filled world of real scientific research.

In short: We built a harder, more realistic test to stop AI from cheating and to help us figure out exactly where their "brain" breaks when trying to connect complex clues.

Here is a detailed technical summary of the paper "BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence."

1. Problem Statement

Current Large Language Models (LLMs) have improved document-based Question Answering (QA), but significant gaps remain in high-stakes domains (finance, healthcare, academic research) where answers are rarely explicit. Instead, they require multi-hop reasoning across long, heterogeneous documents containing text, tables, and figures.

Existing benchmarks suffer from three main limitations:

Lack of Intermediate Supervision: Most focus solely on final answer correctness, ignoring the validity of intermediate reasoning steps.
Shallow Multimodal Integration: Existing multimodal datasets often treat modalities (text, tables, figures) as independent or redundant sources, allowing models to rely on text cues while ignoring complex tabular or visual data.
Short Context Bias: Most benchmarks rely on short passages (e.g., Wikipedia) rather than long-form scientific papers where claims in text are quantified in tables and validated in figures, creating intrinsic cross-modal dependency chains.

2. Methodology: The BRIDGE Benchmark

Dataset Construction

Source: 262 top-tier research papers (2023–2025) from venues like ACL, EMNLP, CVPR, and ICCV.
Scale: 11,857 multi-hop QA pairs.
Preprocessing: Used Adobe PDF Extract API to parse PDFs into semantic entities with layout-aware metadata (page indices, bounding boxes) for text, tables, and figures.
Generation Strategy:
- Question Types: Three categories defined by reasoning patterns:
  - Causal Reasoning (Re): Hops connected by causal relations.
  - Comparative (Cp): Comparisons across entities/numerical values.
  - Abstractive (Ab): Holistic summary-style answers requiring full-paper understanding.
- Prompting: A two-stage framework using Chain-of-Thought (CoT) prompting: (1) Structure Mining to extract entity relations, and (2) Constraint-Guided Generation to create QA pairs with explicit evidence hops.
- Quality Control: A dual-stage filtering pipeline involving rule-based checks and an "LLM-as-a-judge" to eliminate hallucinations and "single-hop shortcuts."

Task Definition
The task requires a model $F$ to take a question $q$ and a multi-page document $D$ (containing text, tables, figures) and output a final answer $a$ plus a set of supporting evidences $E$ .

Reasoning Structures: Supports both Chain-like (sequential, dependent hops) and Fan-out (parallel evidence collection) structures.
Evaluation: Goes beyond answer accuracy to include step-level evaluation of intermediate reasoning states and evidence grounding.

3. Key Contributions

BRIDGE Benchmark: The first benchmark specifically designed for multi-hop reasoning over long multimodal scientific documents, supporting both chain-like and fan-out structures.
Explicit Reasoning Annotations: Provides step-level annotations for evidence usage, enabling evaluation of reasoning depth rather than just final output correctness.
Structured Error Taxonomy: Introduces a framework to diagnose specific failure modes (e.g., grounding errors, comparison reversals, evidence missing).
Comprehensive Evaluation Protocol: Establishes a unified pipeline using LLM-as-a-judge (Audit, Accuracy, Fidelity) alongside lexical metrics (ROUGE, BLEU) to assess both factual correctness and reasoning alignment.

4. Experimental Results

The authors evaluated state-of-the-art models (ChatGPT, Gemma, Gemini, Qwen) and Retrieval-Augmented Generation (RAG) systems using ColPali as a multimodal retriever.

Key Findings:

Model Performance:
- ChatGPT performed best overall (Audit scores ~4.4), followed by Gemma, Gemini, and Qwen.
- Strategy Sensitivity: Performance varied significantly by prompting strategy. For instance, Gemini degraded with CoT/Reflection, while Qwen improved.
- Lexical vs. Factual: High ROUGE/BLEU scores did not always correlate with high factual grounding (Fidelity). CoT prompting often increased factual correctness but decreased lexical overlap with ground truth due to paraphrasing.
The "RAG Gap":
- Integrating ColPali (a strong visual retriever) with Gemini significantly degraded performance compared to direct prompting.
- Result: Audit scores dropped by ~1.7, and Accuracy by ~1.8.
- Cause: Retrieval mismatch and the difficulty of locating multi-hop evidence across long documents led to "evidence missing" rather than just paraphrasing errors.
Difficulty Breakdown:
- Question Type: Comparative questions were the hardest (even top models dropped significantly), while Causal questions were most stable.
- Evidence Modality: Tables were the most challenging modality. Models performed significantly worse on table-based evidence compared to text or figures (e.g., Gemini's audit dropped from 4.25 on text to 3.37 on tables).
- Document Depth: Performance degraded as the required evidence appeared on deeper pages (e.g., pages 21+), indicating limitations in long-context search and retention.
- Hop Depth: Strong models maintained performance between 2-hop and 3+-hop questions, suggesting hop depth alone isn't the primary difficulty factor; rather, the complexity of cross-modal alignment is.

5. Significance and Conclusion

Significance:

Diagnosing Reasoning Failures: BRIDGE reveals that current SOTA models and RAG systems suffer from systematic deficiencies in evidence aggregation and grounding, which are hidden by conventional "answer-only" evaluation.
Multimodal Challenge: It highlights that tables are a specific bottleneck for multimodal reasoning, requiring more than just text-based pattern matching.
RAG Limitations: The study demonstrates that current retrieval mechanisms (even visual ones like ColPali) are insufficient for complex, multi-hop reasoning in long scientific documents, often introducing more noise than signal.

Future Directions:
The benchmark motivates future work in retrieval calibration, evidence verification, and citation-faithful generation. It serves as a targeted testbed for developing models that can truly synthesize information across text, tables, and figures in long-form documents.

BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence

The Problem: The "Smart" Detective Who Skips Steps

The Solution: BRIDGE

What They Found (The Plot Twist)

The Takeaway

1. Problem Statement

2. Methodology: The BRIDGE Benchmark

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations

MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

ConFu: Contemplate the Future for Better Speculative Sampling

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance