Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Imagine you have a massive library of engineering blueprints—thousands of pages filled with complex drawings, numbers, and tiny details. You want to ask a specific question, like "What are the dimensions of the support beam on Bridge A?"

There are two ways to build a system to answer this question. The old way is like hiring a super-photographer to look at every single page in the library before you even ask your question. The new way (proposed in this paper) is like hiring a smart librarian who just knows where the books are, and only calls the photographer after you ask.

Here is the breakdown of the paper's idea, "Deferred Visual Ingestion" (DVI), using simple analogies.

1. The Problem: The "Blind Description" Trap

The Old Way (Pre-Ingestion):
Imagine you have 500 blueprints. The old method says: "Let's hire a robot (an AI) to look at every single page right now, write a summary of what it sees, and file that summary away."

The Flaw: The robot doesn't know what you are going to ask. It writes a generic summary like, "This page has a drawing of a bridge."
The Disaster: When you later ask, "What is the specific bolt size on Pier 3?", the robot's summary missed that detail because it wasn't looking for it. Also, if you have 20 bridges that look almost identical, the robot's summaries are all so similar that the computer gets confused and can't find the right page. It's like trying to find a specific needle in a haystack where every needle looks exactly the same.

2. The Solution: "Index for Locating, Not Understanding"

The New Way (Deferred Visual Ingestion):
The authors propose a smarter strategy: Don't try to understand the picture until you have to.

Instead of hiring the robot to read every page now, you just look at the Table of Contents and the Drawing Numbers (like "Bridge-A-Part-1"). You build a simple map (an index) that says: "Page 45 is about Pier 3." This costs almost nothing and takes seconds.

When you ask your question:

The system checks the map and finds the likely pages (e.g., "It's probably Page 45").
Only then does it send the actual, original image of Page 45 to the robot, along with your specific question.
The robot looks at the image with the question in mind and gives you the answer.

3. The Secret Sauce: The "HDNC" Algorithm

How does the system know which page is which without reading them?
Engineering drawings usually have a strict numbering system (like Project-Bridge-Section-123). The paper introduces a clever trick called HDNC.

The Analogy: Imagine a library where every book spine has a code like A-B-10. The system realizes that all books starting with A-B are about "Bridges," and those ending in 10 are about "Foundations."
The Magic: The computer automatically sorts these codes into a hierarchy (Bridge > Foundation > Specific Part) just by looking at the numbers. It builds a perfect map of the library without ever needing to "read" the content of the books. This is zero-cost because it's just math, not expensive AI reading.

4. Why This Works Better

The paper tested this on three types of documents: Bridge drawings, Steel catalogs, and Circuit diagrams.

The Result: The old method (hiring the robot early) got the right answer only 24% of the time. The new method (hiring the robot only when needed) got it right 65% of the time.
Why?
1. No Information Loss: The old method threw away tiny details in its generic summaries. The new method keeps the original, high-definition image until the very last second.
2. Better Search: The old method tried to find pages by "vibe" (semantic similarity), which fails when documents look alike. The new method uses exact matching (like searching for a specific ID number), which is much more reliable for engineering docs.
3. Cheaper: You don't pay the expensive AI to read 500 pages you might never look at. You only pay for the 2 or 3 pages you actually need.

5. The "Lazy" Genius

The core philosophy of this paper is "Lazy Evaluation."
In computer science, "lazy" means "don't do work until it's absolutely necessary."

Old Way: Do all the work upfront (read every page), even if you might not need it.
New Way: Wait until you have a specific question, then do the work only on the relevant pages.

Summary

Think of it like ordering food.

Pre-Ingestion (Old): The chef cooks a giant buffet of 500 dishes before you even sit down, hoping you like something. Most of it gets cold and wasted.
Deferred Visual Ingestion (New): The chef waits for you to order. You say, "I want the spicy shrimp." The chef then goes to the kitchen, looks at the fresh shrimp, and cooks just that dish perfectly.

This paper proves that for complex, visual documents like engineering drawings, waiting to look at the image until you have a question is the fastest, cheapest, and most accurate way to get answers.

1. Problem Statement

The paper addresses the limitations of current Multimodal Retrieval-Augmented Generation (RAG) systems when applied to visual-dense engineering documents (e.g., bridge drawings, steel catalogs, circuit schematics).

The Pre-Ingestion (PI) Paradigm: Existing methods typically use a "Pre-Ingestion" strategy where a Vision Language Model (VLM) processes every page during the indexing phase to generate generic text descriptions (blind descriptions). These descriptions are then embedded into vectors for retrieval.
The Dual Dilemma:
1. Information Loss: Blind descriptions generated without knowledge of the user's specific query inevitably omit critical, high-density visual details (e.g., specific dimension annotations, terminal numbers, pipeline routes).
2. Retrieval Failure on Homogeneous Data: Engineering documents within a single project are often structurally and visually identical. Embedding vectors for these similar pages cluster tightly in the vector space, causing cosine similarity retrieval to fail. Even state-of-the-art models like ColPali struggle to distinguish between them.
Irreversibility: Once information is lost during preprocessing or retrieval fails to locate the correct page, the system cannot recover the original visual data, leading to unidirectional and irreversible errors.

2. Methodology: Deferred Visual Ingestion (DVI)

The authors propose Deferred Visual Ingestion (DVI), a framework based on the principle: "Index for locating, not understanding." This approach defers visual comprehension from the preprocessing phase to the inference phase.

A. Preprocessing Stage (Zero VLM Cost)

Instead of calling a VLM on every page, DVI uses lightweight, rule-based processing:

Structural Extraction: It extracts document structural information such as Tables of Contents (TOC), drawing titles, and drawing numbers.
HDNC Algorithm (Hierarchical Drawing Number Clustering):
- This is the core innovation for engineering drawings. It exploits the systematic naming conventions of engineering drawing numbers (e.g., PROJID-GRP-PKG-ST-BR-DR-101013).
- It automatically discovers hierarchical structures (Project → Category → Sub-category) by analyzing numeric suffixes using a Trie data structure.
- It generates hierarchical labels (e.g., "Details/Pier/Pier-3") without any API calls or costs.
Index Construction: A BM25 search index is built using drawing titles, HDNC hierarchical labels, and optional page text (depending on text quality).

B. Inference Stage (On-Demand VLM)

Retrieval: When a user query arrives, the system uses BM25 (exact keyword matching) to retrieve the top- $k$ candidate pages based on the structural index.
Targeted Analysis: The original high-resolution images of the retrieved pages, along with the specific user question, are sent to a VLM.
Advantage: The VLM analyzes the image with the question in mind, focusing only on relevant visual details, ensuring 100% information fidelity and avoiding the "blind description" bottleneck.

C. Text Quality Adaptivity

The paper introduces a strategy to handle different document types:

Vector PDFs (e.g., CAD exports): Text layers are clean; fusing page text into the index improves retrieval.
Scanned Documents (Low-quality OCR): Fusing noisy OCR text degrades performance. The system should rely solely on structural indices (TOC/HDNC) for these cases.

3. Key Contributions

The DVI Framework: A novel architecture that eliminates VLM calls during preprocessing, reducing costs to zero and deferring visual analysis to the inference stage.
HDNC Algorithm: A zero-cost, automatic indexing method that leverages engineering drawing numbering systems to build hierarchical indices, outperforming manual metadata extraction in recall.
Text Quality Adaptive Strategy: The discovery that text fusion has diametrically opposite effects based on document type (Vector PDF: +21.3pp gain; Scanned Document: -40.9pp loss), leading to a dynamic indexing strategy.
Large-Scale Empirical Validation: The first systematic comparison of RAG methods on homogeneous engineering documents, revealing that embedding-based retrieval is structurally flawed for this domain.

4. Experimental Results

Experiments were conducted on three datasets: Bridge Engineering Drawings (1,323 questions), Steel Product Catalog (186 questions), and CircuitVQA (9,315 questions).

Retrieval Performance:
- Bridge Dataset: DVI achieved 68.0% PageR@3, significantly outperforming ColPali (20.1%) and standard PI-embedding (30.7%).
- Steel Catalog: DVI achieved 65.6% PageR@3 vs. PI's 23.1%.
- CircuitVQA: DVI achieved 31.2% ImgR@3 vs. PI's 0.7%.
- Key Finding: Embedding retrieval fails systematically on homogeneous engineering documents, while BM25 exact matching succeeds.
End-to-End QA Accuracy:
- Bridge: DVI achieved 65.6% accuracy vs. PI's 24.3% (+41.3pp improvement).
- Steel: DVI achieved 30.6% accuracy vs. PI's 16.1% (+14.5pp improvement).
Bottleneck Analysis:
- VLM conversion rates (accuracy given the correct page) were nearly identical for DVI and PI (~93-94%).
- Conclusion: The performance gap is entirely due to retrieval, not VLM comprehension. Improving retrieval is more effective than improving the VLM.
Ablation Studies:
- HDNC automatic indexing alone provided a +27.5pp retrieval improvement over title-only indexing.
- HDNC recall even surpassed manually curated metadata extracted by VLMs.

5. Significance and Implications

Paradigm Shift: The paper challenges the "Pre-Ingestion" dogma in multimodal RAG, proving that for visual-dense, structured documents, "understanding" every page in advance is counterproductive.
Cost Efficiency: DVI reduces preprocessing costs from "per-page VLM calls" to "zero API calls," making it highly scalable for large engineering document repositories.
Domain Specificity: It highlights that generic RAG benchmarks (often heterogeneous) do not translate to engineering domains. Engineering documents require structural indexing and exact matching rather than semantic vector similarity.
Practical Deployment: The findings on text quality adaptivity provide a crucial guideline for real-world systems: do not blindly fuse OCR text; assess quality first.

In summary, DVI demonstrates that for visual-dense engineering documents, lazy evaluation (analyzing only what is queried) combined with structural indexing (HDNC) yields superior accuracy, lower latency, and zero preprocessing cost compared to traditional embedding-based approaches.