AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

Imagine you are a detective trying to solve a mystery, but instead of a few clues, you are handed a stack of 500 encyclopedias. Your boss asks, "How much did the company's assets grow last year?"

The Old Way (Traditional RAG):
In the past, to answer this, you would photocopy every single page of those 500 encyclopedias and shove them all into a giant box to give to your assistant (the AI).

The Problem: Your assistant has to read through thousands of pages of irrelevant junk—advertisements, footers, blank pages, and chapters about "Company History" from 1990—just to find the one table with the numbers you need.
The Result: The assistant gets overwhelmed, misses the important numbers because they are buried in the noise, and might even start making things up (hallucinating) because it's trying too hard to find a pattern in the chaos. It's like trying to find a specific needle in a haystack by burning the whole haystack to see if the needle glows.

The New Way (AgenticOCR):
This paper introduces AgenticOCR, which acts like a super-smart, proactive research assistant who doesn't just wait for instructions but thinks before acting.

Here is how it works, using a simple analogy:

1. The "Thinking" Detective

Instead of blindly photocopying everything, AgenticOCR looks at the document and asks: "Where is the answer likely to be?"

If the question is about a table, it zooms in specifically on that table.
If the text is sideways (rotated), it mentally rotates the page to read it.
If the font is tiny, it zooms in like a magnifying glass.

It only "decompresses" (reads and processes) the tiny, specific parts of the document that actually matter. It ignores the rest.

2. The "On-Demand" Library

Think of the document as a massive library.

Old System: You take the entire library, lock it in a room, and tell the AI to read it. The AI drowns in books it doesn't need.
AgenticOCR: You tell the AI, "I need the 1995 financial report." The AI walks to the shelf, pulls out only that book, opens it to page 42, highlights the specific paragraph, and hands you just that piece of paper. It leaves the rest of the library untouched.

3. Why This Matters (The "Token" Budget)

AI models have a "memory limit" (called a token budget). Imagine your AI has a backpack that can only hold 10 items.

Old Way: You stuff the backpack with 10 whole encyclopedias. There's no room left for the actual answer, and the AI gets confused.
AgenticOCR: You put only the 3 specific pages with the answer in the backpack. Now the AI has plenty of room to think clearly, analyze the data, and give you a perfect answer without getting tired or confused.

The Big Picture

The paper calls this the "Third Building Block" of visual document AI.

Block 1: Finding the right document (Retrieval).
Block 2: Ranking the best pages (Reranking).
Block 3 (AgenticOCR): Reading the document intelligently.

In short: AgenticOCR changes AI from a passive machine that reads everything you give it, into an active agent that knows what to look for, how to look at it, and only reads what is necessary. This makes the AI faster, cheaper (less computing power needed), and much more accurate, especially for complex documents like financial reports or technical manuals.

1. Problem Statement

The paper addresses a critical bottleneck in Visual Retrieval-Augmented Generation (Visual RAG), particularly when processing complex, information-dense documents like financial reports, technical manuals, and academic papers.

The Granularity Mismatch: Current Visual RAG pipelines typically retrieve and process documents at the page level. This forces the generator (LLM/VLM) to ingest entire pages containing headers, footers, decorative elements, and irrelevant sections.
Consequences:
1. Attention Dilution: The generator's attention mechanism is overwhelmed by extraneous visual context, reducing its ability to focus on query-relevant evidence.
2. Token Inefficiency: High-resolution pages must be compressed into limited visual token budgets, often sacrificing fine-grained details (e.g., small fonts, rotated tables, complex formulas).
3. Hallucination Risk: The combination of irrelevant context and compressed visual data increases the likelihood of model hallucinations.
The Limitation of Current OCR: While traditional full-document OCR has reached high accuracy (90–95%), it remains a static, "parse-everything" process. It lacks the ability to dynamically adapt to specific user queries, leading to unnecessary data processing.

2. Methodology: AgenticOCR

The authors propose AgenticOCR, a paradigm shift from static pre-processing to a dynamic, query-driven, agentic process. Instead of parsing the whole document upfront, AgenticOCR acts as an intelligent middleware that performs "on-demand decompression" of visual information.

Core Components

The image_zoom_and_ocr_tool Primitive:
- A unified tool that allows the agent to interact with the document image.
- Parameters: It accepts a bounding box (bbox), a rotation angle (θ), and a semantic type (τ ∈ {region, text, table, image, equation}).
- Modes:
  - Region Mode: Performs layout analysis + fine-grained recognition on complex areas.
  - Element Mode: Directly applies OCR to specific elements (text, tables) for efficiency.
  - Image Mode: Returns cropped visual patches without OCR for pure visual perception.
- Function: The model autonomously decides where to look, how to orient (rotate), and what granularity to parse, mimicking human visual attention.
Two-Stage Training Pipeline:
- Stage 1: Supervised Fine-Tuning (SFT) via Trajectory Distillation:
  - Cold Start: Uses rejection sampling on Gemini-3-Pro-Preview to distill high-quality reasoning trajectories from the ViDoRe-v3 benchmark.
  - Filtering: Employs a dual-threshold strategy (using $IoU_{EM}$ and $IoU_{min}$ ) to filter for high-precision bounding box predictions and constructs negative samples (irrelevant pages) to teach the model to suppress false positives.
  - Goal: Instills a stable prior for when and how to invoke the tool.
- Stage 2: Alignment via Reinforcement Learning (RL):
  - Algorithm: Uses Group Relative Policy Optimization (GRPO).
  - Curriculum: Focuses on "ambiguous cases" where the SFT model is unstable.
  - Reward Design: Optimizes for coverage ( $Recall_{min}$ $R ec a l l_{min}$ , $Recall_{EM}$ $R ec a l l_{E M}$ ) while penalizing:
    - Spurious Predictions: Hallucinated boxes.
    - Redundant Overlap: Duplicate evidence.
    - Lazy Full-Page Parsing: Penalizes invoking "region" mode on >85% of a page (forcing genuine spatial reasoning).
Integration Protocol:
- AgenticOCR operates as a plug-and-play module between the Retriever and the Generator.
- It processes retrieved pages independently, extracting structured evidence (cropped images + OCR text) and feeding this compact, high-signal data to the generator (e.g., Gemini, Qwen).

3. Key Contributions

Conceptual Formalization: Introduces AgenticOCR as a potential "third building block" in the Visual RAG stack, sitting alongside Embedding and Reranking modules. It shifts the paradigm from "parsing everything" to "parsing only what you need."
Model Realization: Develops high-performing models (AgenticOCR-4B and AgenticOCR-8B) based on Qwen3-VL. The training pipeline combines trajectory distillation from a stronger teacher model (Gemini) with GRPO-based alignment. Both models and datasets are open-sourced.
Empirical Validation: Demonstrates that AgenticOCR significantly improves the signal-to-token ratio in Visual RAG, leading to state-of-the-art (SOTA) performance in long-document understanding tasks.

4. Experimental Results

The authors evaluated their system on MMLongBench-Doc (long, multi-domain documents) and FinRAGBench-V (financial reports with dense layouts).

Performance vs. Human Experts:
- On MMLongBench-Doc, the AgenticOCR-8B model with "Evidence+OCR" input achieved 66.4% accuracy, surpassing the human expert baseline of 65.8%.
- It outperformed existing agentic frameworks (e.g., DocLens, MDocAgent) and standard VLMs augmented with full-page OCR.
Modalities:
- Showed exceptional strength in Text (TXT), Layout (LAY), and Figure (FIG) reasoning, where precise localization and layout preservation are critical.
- Limitations: Slightly lagged on Table (TAB) extraction (due to incomplete context in cropped regions) and Unanswerable (UNA) questions (due to lower retrieval precision compared to specialized multi-agent systems like DocLens).
Efficiency:
- While token consumption varied depending on the generator's image token allocation policy (e.g., Gemini's fixed per-image cost), AgenticOCR consistently improved accuracy.
- When using generators with flexible token budgets (e.g., Qwen3-VL), AgenticOCR significantly reduced total input tokens compared to full-page baselines while maintaining or improving accuracy.

5. Significance and Future Work

Paradigm Shift: AgenticOCR redefines OCR not as a static preprocessing step but as an active, reasoning-driven perception process. It enables VLMs to "think with images" by dynamically manipulating the visual input.
Efficiency: By decoupling retrieval granularity from rigid page-level chunking, it solves the "noise vs. signal" problem in Visual RAG, allowing smaller models to achieve expert-level performance.
Future Directions: The authors suggest improving retrieval precision (to reduce irrelevant pages passed to the agent), enhancing table extraction context, and exploring tighter integrations between visual agents and generative LLMs.

In conclusion, AgenticOCR represents a significant step forward in making Visual RAG systems more efficient, accurate, and capable of handling the complexities of real-world visual documents.

AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

1. The "Thinking" Detective

2. The "On-Demand" Library

3. Why This Matters (The "Token" Budget)

The Big Picture

1. Problem Statement

2. Methodology: AgenticOCR

Core Components

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

CIPHER: Conformer-based Inference of Phonemes from High-density EEG

SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

Skeleton-based Coherence Modeling in Narratives

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets