FireRed-OCR: Turning a "Generalist" into a "Document Surgeon"

Imagine you have a brilliant, well-read friend who can look at a picture of a painting and tell you a beautiful story about the colors and the artist's intent. This is what Large Vision-Language Models (VLMs) are like today. They are smart, creative, and great at general tasks.

But, if you ask this friend to read a complex financial report, a handwritten math exam, or a newspaper with columns running in different directions, they start to make up things. They might draw a table that doesn't exist, mix up the order of the paragraphs, or write a math formula that looks right but is actually nonsense. In the tech world, we call this "Structural Hallucination." It's like a chef who knows how to cook a great steak but keeps forgetting to put the salt on the fries.

FireRed-OCR is a new framework created by the team at Xiaohongshu (a popular Chinese social media app) to fix this. They took a general "smart" model (based on Qwen3-VL) and trained it to become a pixel-perfect document surgeon. Here is how they did it, explained simply.

1. The Problem: The "Generalist" vs. The "Specialist"

Think of a general VLM as a Jack-of-all-trades. It can do a little bit of everything, but when it comes to the strict rules of document formatting (like making sure a table has the right number of columns or a math equation is perfectly balanced), it gets sloppy.

In the real world, if a bank's software reads a check wrong because the model "hallucinated" a zero, that's a disaster. They need a specialist who follows the rules strictly.

2. The Solution: The "Geometry + Semantics" Data Factory

To train this specialist, you can't just throw random documents at it. If you feed it 1,000 simple novels and only 1 complex tax form, it will only learn how to read novels.

The team built a "Data Factory" with two special machines:

The Geometry Scanner: Instead of reading the words, this machine looks at the shape of the page. Is it a single column? Is it a messy table? Is it a form with boxes? It groups documents by how they look, not just what they say. This ensures the model sees plenty of weird, difficult layouts (the "long-tail" problems).
The Semantic Tagger: This machine labels the content (e.g., "Math," "Legal," "Handwriting").

By mixing these two, they created a perfectly balanced diet of training data. They didn't just sample randomly; they curated the data to ensure the model practiced on the hardest puzzles first.

3. The Training: A Three-Step "Boot Camp"

The team didn't just dump data on the model. They used a Three-Stage Progressive Training strategy, like a martial arts master teaching a student.

Stage 1: The "Eyes and Hands" Drill (Pre-alignment)

Before the student can write an essay, they must learn to point at things.

What happens: The model is trained to point to specific words on a page and say what they are (Detection & OCR).
The Analogy: Imagine a child learning to read. First, they learn to point at a word and say "Cat." They aren't writing a story yet; they are just learning to connect the shape of the letters to the sound. This grounds the model in reality so it stops guessing where things are.

Stage 2: The "Strict Editor" Drill (Specialized SFT)

Now that the model can point at words, it needs to learn the rules of grammar and formatting.

What happens: The model is shown high-quality documents and taught to rewrite them perfectly in Markdown (a simple coding language for text).
The Analogy: The model is now a copy editor. It learns that if it sees a header, it must use a #. If it sees a table, it must use pipes |. It learns that a table must close properly, or the whole document breaks. It stops being creative and starts being precise.

Stage 3: The "Referee" Drill (Format-Constrained GRPO)

This is the secret sauce. Even smart models sometimes cheat or get lazy.

What happens: They use a technique called GRPO (Group Relative Policy Optimization). Imagine the model generates 5 different versions of a document. A "Referee" (a set of strict rules) checks them:
- Did the math formula compile? (If no, -10 points).
- Did the table have the same number of columns in every row? (If no, -10 points).
- Did all the brackets close? (If no, -10 points).
The Analogy: It's like a video game where the model gets a high score only if it follows the rules perfectly. If it tries to "hack" the system by making up a fake table, the referee catches it immediately. The model learns that structure is just as important as content.

4. The Results: Beating the Giants

The team tested their new model, FireRed-OCR, on a tough benchmark called OmniDocBench.

The Surprise: FireRed-OCR is a relatively small model (only 2 billion parameters). Compare that to giants like Qwen3-VL-235B (235 billion parameters) or Gemini, which are massive.
The Outcome: FireRed-OCR won. It scored 92.94%, beating the massive general models and even the specialized "pipeline" systems that use multiple different tools to do the job.
Why it matters: It proves you don't need a super-computer-sized brain to read documents perfectly. You just need the right training data and the right "boot camp" strategy.

Summary: The "FireRed" Magic

Think of FireRed-OCR as taking a talented but messy artist and turning them into a precision architect.

Geometry Factory: They gave the architect a library of every possible building blueprint, not just the easy ones.
Three-Stage Training: They taught the architect to measure first, then draft, then finally, to pass a strict building code inspection.
The Result: A model that doesn't just "guess" what a document says, but reconstructs it with pixel-perfect accuracy, ensuring that every table, formula, and paragraph is exactly where it should be.

They have open-sourced their code and model, meaning anyone can now use this "architect" to turn messy scans into perfect, usable digital documents.

1. Problem Statement

While Large Vision-Language Models (VLMs) have achieved impressive general capabilities, they struggle significantly with Document Intelligence tasks, particularly in industrial OCR applications. The core issue identified is "Structural Hallucination."

Symptoms: General VLMs often produce outputs with disordered rows in Markdown tables, non-existent syntax in mathematical formulas, or omitted hierarchical logic when processing complex documents (e.g., financial reports, academic papers).
Limitations:
- Traditional Pipeline OCRs (e.g., PaddleOCR) are pixel-precise but lack the semantic understanding to reconstruct logical structures like reading order in multi-column layouts.
- End-to-End VLMs capture semantics well but fail at fine-grained spatial grounding and strict formatting constraints, leading to unusable outputs for downstream applications.
Goal: To transition models from broad semantic interpretation to precise structural generation governed by strict logical rules.

2. Methodology

FireRed-OCR is a systematic framework designed to transform a general-purpose VLM (specifically Qwen3-VL) into a high-performance, pixel-precise structural document parser. The approach relies on two main pillars: a specialized data engine and a progressive training strategy.

A. "Geometry + Semantics" Data Factory

To address the scarcity of high-quality structured data, the authors constructed a data pipeline that moves beyond random sampling:

Dual Indexing Mechanism:
- Geometric Clustering: Uses unsupervised clustering (e.g., K-Means) on visual feature vectors to group documents by layout topology (ignoring text content). This identifies and preserves "long-tail" rare layouts (e.g., nested tables, irregular forms) while downsampling common ones.
- Multi-dimensional Tagging: Tags data by language, layout type, source, and genre to ensure semantic balance.
Stratified Sampling & Re-annotation: Samples are balanced to ensure the model sees a uniform distribution of difficulty. All data is re-annotated into a unified Markdown syntax to eliminate style conflicts from different sources.
Synthetic Data Generation: Procedurally renders complex layouts (HTML/CSS templates) to generate error-free structural priors for scarce combinations (e.g., 3-page spanning tables).
Automated Quality Control: A two-level filtration system ("The Sieve" for heuristics and "The Judge" via LLM) filters noise and identifies "Hard Negatives."
Expert-Level Refinement: Hard cases are routed to advanced proprietary models (e.g., Gemini 3 Pro) for correction, effectively distilling expert reasoning into the training set.

B. Three-Stage Progressive Training Strategy

The training follows a coarse-to-fine curriculum to "tame" the model:

Stage 1: Multi-task Pre-alignment:
- Joint training on Detection, Region OCR, and Full-page Markdown.
- Forces the model to learn physical perception (bounding boxes) before attempting full-page generation, grounding visual understanding.
Stage 2: Specialized Supervised Fine-Tuning (SFT):
- Trains on high-quality Markdown data to standardize syntax (headers, bolding, math).
- Focuses on structural consistency, hierarchical expression, and robustness to cross-lingual/complex layouts.
- Insight: Using coarser data in Stage 1 and finer data in Stage 2 yields better convergence than using only high-precision data throughout.
Stage 3: Format-Constrained GRPO (Reinforcement Learning):
- Utilizes Group Relative Policy Optimization (GRPO) to enforce strict syntactic validity without a separate Value model (saving VRAM).
- Reward Functions:
  - Formula Syntax: Checks LaTeX compilation validity.
  - Hierarchical Closure: Penalizes unclosed tags (e.g., missing table rows).
  - Table Integrity: Ensures column consistency.
  - Textual Accuracy: Measures content fidelity against ground truth.
- Iterative Refinement: The authors propose an Iterative SFT-GRPO cycle, alternating between instruction following (SFT) and structural enforcement (GRPO) to prevent reward hacking and refine complex constraints.

3. Key Contributions

Identification of Structural Hallucination: Defined the primary bottleneck in VLM-based document parsing and proposed a paradigm shift from text generation to structural engineering.
"Geometry + Semantics" Data Factory: A novel pipeline ensuring interpretable, balanced, and high-quality supervision data, effectively handling long-tail layouts.
Three-Stage Training with Constrained GRPO: A progressive strategy that successfully integrates reinforcement learning to enforce strict structural rules (tables, formulas) while maintaining semantic accuracy.
Open-Source Release: The team open-sourced the code, model weights (FireRed-OCR-2B), and a demo, facilitating the "General VLM $\to$ Specialized Structural Expert" paradigm.

4. Experimental Results

The model was evaluated on OmniDocBench v1.5, FireRedBench, OCRBench, and TEDS.

Overall Performance: FireRed-OCR-2B achieved a 92.94% overall score on OmniDocBench v1.5, securing the State-of-the-Art (SOTA) position.
Comparison with Baselines:
- Vs. General VLMs: Outperformed massive models like Qwen3-VL-235B (89.15%) and Gemini-3.0 Pro (90.33%) despite having only 2B parameters (less than 1% of their size).
- Vs. Specialized E2E Models: Surpassed DeepSeek-OCR 2 (91.09%) and OCRVerse (88.56%).
- Vs. Pipeline Systems: Challenged heavy pipeline systems like MinerU2.5 (90.67%) and approached PaddleOCR-VL-1.5 (94.50%) while using a single-model architecture.
Specific Metrics:
- Table Structure (TEDS): Achieved 90.31, significantly outperforming Qwen3-VL-235B (86.21) and GPT-4o (67.07).
- Text Recognition: Achieved the lowest Edit Distance (0.032) among all evaluated models.
- FireRedBench: Scored 74.62, outperforming complex pipeline systems like GLM-OCR (74.33) in handling distorted scans and dense layouts.
Ablation Studies: Confirmed that a balanced mixture strategy (1:1:1 ratio of Text, Table, and Formula data) is superior to single-domain optimization or naive joint training, mitigating modality interference.

5. Significance

Efficiency: Demonstrates that a small, specialized model (2B parameters) can outperform massive generalist models (hundreds of billions of parameters) in fine-grained OCR tasks when trained with the right data and constraints.
Industrial Applicability: Solves the "structural hallucination" problem, making VLM outputs reliable for downstream industrial applications (e.g., automated data extraction from financial reports) without needing complex, error-prone pipeline modules.
Paradigm Shift: Establishes a reproducible methodology for converting general multimodal models into domain-specific structural experts using a combination of geometric data curation and format-constrained reinforcement learning.

FireRed-OCR Technical Report