DISCO: Document Intelligence Suite for COmparative Evaluation

Imagine you have a massive, messy library filled with all kinds of documents: handwritten letters from the 1920s, colorful infographics about space, complex medical prescriptions, and multi-page financial reports. You want to ask a computer, "What does this say?" or "What is the total cost here?"

The paper "DISCO" is like a giant, rigorous taste test for two different types of "librarians" (AI systems) that try to read these documents for you. The authors want to figure out: Which librarian should you hire for which job?

Here is the breakdown of the two librarians and what the study found:

The Two Librarians

The "Scanner" (OCR Pipeline):
- How they work: This librarian is like a high-tech photocopier. First, they scan the page and turn every single letter into plain text (like a Word document). Then, they hand that text to a second librarian (a language model) to answer your question.
- Superpower: They are incredibly precise with handwriting and very long documents. They don't get confused by messy layouts because they just focus on the letters.
- Weakness: If the text is in a weird font, a different language, or part of a complex chart, they might miss the context or get the layout wrong. They also lose the "picture" of the document once they turn it into text.
The "Artist" (VLM - Vision-Language Model):
- How they work: This librarian looks at the whole picture at once. They don't just read the words; they see the colors, the charts, the handwriting style, and where things are placed on the page. They answer your question directly from the image.
- Superpower: They are amazing at understanding charts, colorful infographics, and documents with many different languages mixed together. They "get" the vibe of the document.
- Weakness: They can get overwhelmed by huge, multi-page documents (like a 100-page contract) and sometimes they might "hallucinate" (make up details) if the handwriting is too messy.

The Great Taste Test (The Results)

The researchers tested these librarians on a buffet of different documents. Here is what they discovered:

The Handwriting Challenge:
- Analogy: Imagine trying to read a doctor's messy scribble.
- Result: The Scanner wins here. It's trained specifically to decipher messy handwriting. The Artist gets confused and makes more mistakes, unless you give them very specific instructions (a "task-aware prompt").
The Multilingual & Chart Challenge:
- Analogy: Imagine a menu with Chinese, French, and English mixed together, with pictures of food.
- Result: The Artist wins here. They are used to seeing different scripts and visual layouts. The Scanner struggles to read non-English letters and often breaks the connection between a chart and its label.
The "Long Book" Challenge:
- Analogy: Asking a question about a specific detail in a 50-page legal contract.
- Result: The Scanner wins again. When you have a huge document, the Artist gets lost in the noise. The Scanner breaks the document down into text, making it easier to find the needle in the haystack.
The "Single Page" Challenge:
- Analogy: A simple invoice or a postcard.
- Result: The Artist wins. Since the document is small and visual, looking at the whole picture is faster and more accurate than scanning it first and then reading the text.

The "Prompt" Surprise

The researchers also tried giving the librarians different instructions (prompts).

The Finding: Sometimes, giving the Artist specific instructions (like "Be careful with handwriting") helped. But other times, it actually made them worse at their job! It's like telling a chef, "Don't burn the toast," and they end up burning it because they were overthinking it. There is no "one size fits all" instruction.

The Big Takeaway

The paper concludes that there is no single "best" AI for all documents.

If you have handwritten notes, medical forms, or long contracts, use the Scanner (OCR). It's the reliable, methodical worker.
If you have colorful charts, infographics, or mixed-language documents, use the Artist (VLM). It's the creative, visual thinker.

DISCO is essentially a guidebook that tells businesses: "Don't just buy the most expensive AI and hope for the best. Look at your document first. If it's messy and long, hire the Scanner. If it's visual and colorful, hire the Artist."

This saves companies money and prevents errors by matching the right tool to the right job.

1. Problem Statement

Document intelligence faces a critical evaluation gap: current benchmarks typically report only end-to-end task accuracy (e.g., Question Answering scores). This makes it impossible to diagnose whether a system failure stems from:

Perception errors: The Optical Character Recognition (OCR) pipeline failed to extract the text correctly.
Representation errors: The text extraction lost spatial or layout context.
Reasoning errors: The Language Model (LLM) failed to understand the content even if the text was correct.

Furthermore, practitioners lack empirical guidance on when to use OCR-first pipelines (extract text $\to$ LLM) versus end-to-end Vision-Language Models (VLMs) (image $\to$ answer), especially across diverse document types like handwritten notes, multilingual scripts, medical forms, and multi-page reports.

2. Methodology: The DISCO Framework

The authors introduce DISCO, a diagnostic evaluation suite that decouples Text Parsing from Question Answering (QA).

A. Benchmark Suite Composition

DISCO aggregates and subsamples (to <500 samples for feasibility) eight established datasets covering diverse document characteristics:

Parsing Tasks:
- IAMDISCO: Handwritten text.
- ICDARDISCO: Multilingual scene text (10 languages).
- RxPad: French medical prescriptions.
- PubLayNet: Scientific document layouts.
Question Answering Tasks:
- DocVQADISCO: Scanned forms and letters.
- InfographicVQADISCO: Visual reports.
- DUDEDISCO: Heterogeneous multi-page documents.
- ChartQAProDISCO: Charts and graphs.

B. Experimental Protocol

The study compares three primary pipeline architectures across these datasets:

OCR-First (QA_OCR): Specialized OCR (e.g., Azure Document Intelligence, Mistral OCR) extracts text $\to$ LLM answers.
Two-Stage VLM (QA_VLM-2stage): VLM extracts text $\to$ VLM (same or different) answers.
Direct VQA (QA_VLM-direct): VLM answers the question directly from the image without intermediate text extraction.

Prompt Variations: The study tests three prompting strategies:

Generic: Standard extraction/answer prompts.
Chain-of-Thought (CoT): Step-by-step reasoning.
Task-Aware: Domain-specific instructions (e.g., "Preserve layout," "Extract medical fields").

C. Evaluation Metrics

To capture different failure modes, the authors use a multi-metric approach:

Parsing: Character Error Rate (CER), Word Error Rate (WER), and Cosine Similarity (SCS) of embeddings.
QA:
- SGT-in-Pred: Ground-Truth substring presence (robust to verbosity).
- SANLS: Average Normalized Levenshtein Similarity (string matching).
- SEM: Exact Match rate.

3. Key Contributions

Diagnostic Framework: DISCO shifts evaluation from "which model wins" to "why it wins," isolating errors to perception, representation, or reasoning stages.
Comprehensive Benchmarking: It provides the first systematic comparison of OCR vs. VLMs across handwriting, multilingual text, medical forms, and multi-page documents.
Empirical Guidelines: The paper offers concrete recommendations for pipeline selection based on document structure and task complexity.
Model Regression Analysis: It reveals that newer model versions (e.g., Mistral OCR 3) do not always outperform predecessors, challenging the assumption that version increments guarantee improvement.

4. Key Results & Findings

A. Parsing Performance

Handwriting (IAM): Specialized OCR is generally superior. However, Task-Aware prompting allows VLMs to match or slightly exceed OCR performance (SCER 0.080 vs 0.087).
Multilingual Text (ICDAR): VLMs significantly outperform OCR, especially on non-Latin scripts (Arabic, Chinese, etc.). Task-aware prompting reduced OCR error rates by ~87% in VLMs compared to generic prompts.
Medical Prescriptions (RxPad): Both OCR and VLMs struggle (high error rates ~0.65 SCER). VLMs tend to output structured key-value pairs, while OCR outputs raw text, causing metric mismatches.

B. Question Answering Performance

Single-Page Visual Documents (DocVQA, InfographicVQA): Direct VQA performs best.
- Reasoning: Intermediate text extraction in two-stage pipelines causes information loss regarding spatial layout and visual cues.
- Metric Discrepancy: Direct VQA often achieves high SGT-in-Pred (correct answer found) but low SANLS/SEM due to verbose, unstructured outputs.
Multi-Page Documents (DUDE): OCR-First pipelines are superior.
- Reasoning: VLMs struggle with long contexts and retrieving specific information across pages. OCR provides more reliable text grounding.
- Error Propagation: In two-stage VLM pipelines, parsing errors compound, leading to significant QA performance drops (14% gap vs. OCR pipelines).

C. Prompting Effects

Task-Aware Prompts: Yield mixed results. They improve performance on multilingual and handwriting tasks but can degrade performance on others or introduce verbosity that hurts string-matching metrics.
Chain-of-Thought: Generally improves reasoning in OCR pipelines but increases latency.

D. Model Specific Observations

Mistral OCR Regression: Mistral OCR 3 (v2512) consistently underperformed Mistral OCR 2 (v2505) across all datasets, showing a ~23% drop in parsing effectiveness on DocVQA.
Azure vs. Mistral: Azure Document Intelligence showed superior layout analysis on single-page forms (DocVQA) but comparable performance to Mistral on multi-page documents.

5. Significance and Implications

Strategic Pipeline Selection: The paper concludes that there is no "one-size-fits-all" solution.
- Use Direct VLMs for single-page, visually rich documents (infographics, forms) where layout is critical.
- Use OCR-First Pipelines for long, multi-page, text-heavy documents where precise text grounding is required.
- Use Specialized OCR for handwritten text unless task-aware prompting is available.
Evaluation Best Practices: Relying solely on end-to-end accuracy is insufficient. Practitioners must evaluate intermediate parsing quality (SGT-in-Extracted-Text) to understand system limits.
Cost-Efficiency Trade-offs: Direct VQA is the most cost-effective and fastest for single questions. Two-stage pipelines become cost-effective only when multiple questions are asked per document (amortizing the extraction cost).

In summary, DISCO provides a rigorous, stage-wise diagnostic tool that reveals the complementary nature of OCR and VLMs, guiding practitioners to select the optimal architecture based on specific document characteristics rather than model hype.