ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering

Imagine you are trying to solve a very tricky puzzle, but the puzzle pieces are scattered across different rooms of a giant, messy library. Some pieces are handwritten notes, some are complex charts, some are printed tables, and some are just blurry photos.

If you ask a single, very smart person (like a standard AI model) to solve this, they might get overwhelmed. They might read the chart but miss the handwritten note next to it, or they might guess the answer without double-checking their work. They are "generalists"—good at many things, but not perfect at the specific, hard parts.

ORCA is a new system that changes the game. Instead of hiring one super-person, ORCA hires a team of specialists and puts them in a room with a project manager. Here is how it works, using a simple analogy:

1. The Project Manager (The "Thinker" Agent)

First, the system doesn't just jump to the answer. It has a "Thinker" agent. Think of this person as the Project Manager.

What they do: They look at the messy document and the question. They don't try to solve it alone. Instead, they break the big question down into small, logical steps.
The Analogy: If the question is "What was the total revenue in Q3?", the Manager says: "Okay, first we need to find the table. Then we need to find the Q3 column. Then we need to read the numbers. Finally, we add them up."

2. The Specialist Team (The "Agent Dock")

Once the Manager has the plan, they call in the right experts from a "dock" of nine different specialists.

The Team:
- The OCR Expert: Reads messy handwriting or blurry text.
- The Table Expert: Understands rows and columns.
- The Chart Expert: Interprets graphs and diagrams.
- The Layout Expert: Knows where things are on the page.
The Analogy: Instead of asking the Project Manager to do the math and read the handwriting, they call the Math Wizard for the numbers and the Handwriting Guru for the notes. They work together, passing the baton of information down the line.

3. The "Stress Test" (The Debate)

This is where ORCA gets really clever. Most AIs just give you an answer and hope it's right. ORCA doesn't trust anyone immediately.

The Process: If the "Manager" (Thinker) and the "Specialist Team" (Experts) give different answers, ORCA doesn't just pick one. It starts a Debate.
The Analogy: Imagine a courtroom.
- The Thesis Agent (the Specialist) says: "I am 100% sure the answer is $500."
- The Antithesis Agent (a challenger) says: "Wait, look at this line here. I think it's actually $550."
- They argue back and forth for a few rounds, showing evidence from the document.
- A Judge listens to both sides. If the Specialist can't defend their answer against the challenger's questions, the Judge changes the answer. If the Specialist holds their ground, the answer is confirmed.

4. The Editor (The "Sanity Checker")

Finally, before the answer is sent to you, a final "Editor" checks it.

The Job: They make sure the answer looks exactly like the document. If the document writes numbers with commas (e.g., "1,000") and the AI wrote "1000", the Editor fixes it. If the document uses a specific date format, the Editor ensures the answer matches.

Why is this better?

No "One-Size-Fits-All": It uses the right tool for the right job (a specialist for charts, a specialist for handwriting).
Double-Checking: It forces the AI to argue with itself to find mistakes before it gives you the final answer.
Transparency: You can see the "Manager's" plan and the "Debate" that happened, so you know how the answer was found, not just what the answer is.

In short: ORCA is like hiring a team of experts with a smart manager and a strict editor, rather than relying on one person to do everything. This makes it much better at solving complex, messy document puzzles than current AI models.

1. Problem Statement

Document Visual Question Answering (DocVQA) involves answering questions based on single-page document images containing diverse modalities (text, tables, figures, handwritten content, forms). While Vision-Language Models (VLMs) have advanced, they face critical limitations in complex scenarios:

Lack of Decomposition: Current models struggle to break down intricate, multi-step questions into manageable sub-tasks.
One-Size-Fits-All Approach: Single models often fail to apply specialized processing (e.g., OCR for handwriting vs. layout analysis for tables) to different document elements simultaneously.
Absence of Verification: Standard VLMs generate answers directly without self-verification, debate, or iterative refinement, leading to hallucinations or confidence issues in complex reasoning tasks.
Inflexibility: Existing Chain-of-Thought (CoT) methods still rely on a single model for all reasoning steps, lacking mechanisms for content-aware specialization or adaptive agent selection.

2. Methodology: The ORCA Framework

ORCA is a multi-agent framework designed to orchestrate explicit reasoning and collaborative execution through a five-stage pipeline. It moves away from monolithic models to a modular system where specialized agents handle specific document components.

Core Architecture Components

Thinker Agent ( $A_{think}$ ): Based on GLM-4.5V-9B. It analyzes the document and question to generate a structured Reasoning Path ( $R$ ) (decomposing the query into logical steps) and an Initial Answer ( $a_T$ ).
Agent Dock: A repository of nine specialized agents, each fine-tuned on Qwen3-VL-8B for specific modalities:
- Afigure (Diagrams/Charts), Atable (Tables/Lists), Aocr (Handwriting/Difficult text), Aform (Forms), Alayout (Structure), Aimage (Photos), Atext (Free text), Ayesno (Binary questions), Aother.
Router Agent ( $A_{route}$ ): A Qwen2.5-VL-7B model trained via constrained generation (Turbo DFS decoding) to predict a binary activation vector. It determines which subset of the 9 specialized agents is required based on the reasoning path $R$ , question $q$ , and document $D$ .
Orchestrator: Sequences the activated agents ( $A_{active}$ ) to execute tasks sequentially, passing the output of one agent as input to the next. Crucially, it masks the thinker's initial answer ( $a_T$ ) in the reasoning path fed to the final expert agent to prevent confirmation bias.

The Five-Stage Pipeline

Context Understanding: The Thinker generates the reasoning path $R$ and initial hypothesis $a_T$ .
Collaborative Agent Execution:
- The Router selects relevant specialized agents.
- The Orchestrator executes them sequentially.
- The final agent produces the Expert Answer ( $a_E$ ).
Stress Testing Session:
- If $a_E \neq a_T$ , a Debate Agent generates challenging follow-up questions ( $q_{debate}$ ) to probe the Expert's confidence.
- The Expert answers $q_{debate}$ and potentially revises $a_E$ .
- An Evaluation Agent judges if the response is coherent and consistent. If the Expert fails, the process moves to Stage 4.
Multi-turn Conversation (Debate):
- Triggered only if uncertainty exists (approx. 8.3% of cases).
- Involves a Thesis Agent (defending $a_E$ ) and an Antithesis Agent (generating an alternative $a_{alt}$ ).
- They engage in a structured 3-turn debate with evidence, criticism, and conclusions.
- A Judge Agent evaluates the debate to determine the final consensus or selects the most confident answer ( $a_C$ ).
Answer Refinement: A Sanity Checker ensures the final answer ( $a_F$ ) matches the document's formatting conventions (e.g., spacing, punctuation) to satisfy evaluation metrics.

3. Key Contributions

Novel Multi-Agent Architecture: Introduces a framework that integrates explicit reasoning decomposition with specialized agent collaboration, moving beyond single-model limitations.
Adaptive Routing & Specialization: Utilizes a learned router to dynamically select and sequence specialized agents based on document content, ensuring the right tool is used for tables, handwriting, or figures.
Adversarial Verification Mechanism: Implements a conditional debate system (Stress Testing + Multi-turn Conversation) that activates only when necessary, providing robust self-verification and conflict resolution without excessive computational overhead.
Bias Mitigation: Introduces a reasoning path masking mechanism to prevent downstream agents from anchoring on the Thinker's initial hypothesis, encouraging independent analysis.
State-of-the-Art Performance: Achieves top results across multiple benchmarks, demonstrating that collaborative orchestration outperforms monolithic scaling.

4. Experimental Results

The framework was evaluated on three benchmarks: Single-Page DocVQA, InfographicsVQA, and OCRBench-v2.

DocVQA & InfographicsVQA:
- ORCA (Qwen3VL-8B) achieved 97.2% on DocVQA and 88.0% on InfographicsVQA.
- This represents a +1.1% gain on DocVQA and a significant +4.9% gain on InfographicsVQA over the best single-model baseline (Qwen3VL-8B-Instruct).
- The improvement is most pronounced in complex reasoning tasks (e.g., InfographicsVQA), validating the need for specialized collaboration.
OCRBench-v2:
- ORCA achieved an average score of 67.1% (with Qwen3VL-8B), outperforming baselines by +1.7%.
- Notably, the framework provided greater relative gains for smaller models (e.g., +3.6% for Qwen2.5-VL-7B), suggesting it effectively compensates for limited model capacity through specialization.
Efficiency & Latency:
- Through early termination (skipping debate stages when $a_E = a_T$ , occurring in 77% of cases), ORCA maintains a favorable accuracy-latency trade-off.
- Full pipeline execution takes ~9.6–13.1s, but the "Lite" version (Stages 1, 2, 5) offers a 4–6x speedup with minimal accuracy loss on complex tasks.

5. Significance and Impact

Paradigm Shift: ORCA establishes a new paradigm for DocVQA, shifting from "bigger models" to "smarter orchestration." It proves that modular, collaborative systems can outperform monolithic models of similar or even larger parameter counts.
Reliability & Explainability: The framework provides transparent reasoning paths and a verifiable debate process, making it suitable for high-stakes document analysis where trust and auditability are crucial.
Scalability: The modular design allows for independent upgrades of specific components (e.g., swapping the router or a specific specialist agent) as foundation models evolve, without retraining the entire system.
Future Directions: The authors plan to optimize the router via Reinforcement Learning, learn orchestration ordering via policy gradients, and extend the framework to multi-page document understanding.

In conclusion, ORCA demonstrates that orchestrated reasoning with specialized agents and adversarial verification is a highly effective strategy for solving complex, multi-modal document understanding problems that currently challenge state-of-the-art VLMs.

ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering

1. The Project Manager (The "Thinker" Agent)

2. The Specialist Team (The "Agent Dock")

3. The "Stress Test" (The Debate)

4. The Editor (The "Sanity Checker")

Why is this better?

1. Problem Statement

2. Methodology: The ORCA Framework

Core Architecture Components

The Five-Stage Pipeline

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization