FireRed-OCR Technical Report

FireRed-OCR is a novel framework that transforms general-purpose VLMs into high-performance, pixel-precise document parsing experts by leveraging a "Geometry + Semantics" data factory and a three-stage progressive training strategy to overcome structural hallucinations and achieve state-of-the-art results on complex document benchmarks.

Hao Wu, Haoran Lou, Xinyue Li, Zuodong Zhong, Zhaojun Sun, Phellon Chen, Xuanhe Zhou, Kai Zuo, Yibo Chen, Xu Tang, Yao Hu, Boxiang Zhou, Jian Wu, Yongji Wu, Wenxin Yu, Yingmiao Liu, Yuhao Huang, Manjie Xu, Gang Liu, Yidong Ma, Zhichao Sun, Changhao Qiao

Published 2026-03-03
📖 6 min read🧠 Deep dive

FireRed-OCR: Turning a "Generalist" into a "Document Surgeon"

Imagine you have a brilliant, well-read friend who can look at a picture of a painting and tell you a beautiful story about the colors and the artist's intent. This is what Large Vision-Language Models (VLMs) are like today. They are smart, creative, and great at general tasks.

But, if you ask this friend to read a complex financial report, a handwritten math exam, or a newspaper with columns running in different directions, they start to make up things. They might draw a table that doesn't exist, mix up the order of the paragraphs, or write a math formula that looks right but is actually nonsense. In the tech world, we call this "Structural Hallucination." It's like a chef who knows how to cook a great steak but keeps forgetting to put the salt on the fries.

FireRed-OCR is a new framework created by the team at Xiaohongshu (a popular Chinese social media app) to fix this. They took a general "smart" model (based on Qwen3-VL) and trained it to become a pixel-perfect document surgeon. Here is how they did it, explained simply.


1. The Problem: The "Generalist" vs. The "Specialist"

Think of a general VLM as a Jack-of-all-trades. It can do a little bit of everything, but when it comes to the strict rules of document formatting (like making sure a table has the right number of columns or a math equation is perfectly balanced), it gets sloppy.

In the real world, if a bank's software reads a check wrong because the model "hallucinated" a zero, that's a disaster. They need a specialist who follows the rules strictly.

2. The Solution: The "Geometry + Semantics" Data Factory

To train this specialist, you can't just throw random documents at it. If you feed it 1,000 simple novels and only 1 complex tax form, it will only learn how to read novels.

The team built a "Data Factory" with two special machines:

  • The Geometry Scanner: Instead of reading the words, this machine looks at the shape of the page. Is it a single column? Is it a messy table? Is it a form with boxes? It groups documents by how they look, not just what they say. This ensures the model sees plenty of weird, difficult layouts (the "long-tail" problems).
  • The Semantic Tagger: This machine labels the content (e.g., "Math," "Legal," "Handwriting").

By mixing these two, they created a perfectly balanced diet of training data. They didn't just sample randomly; they curated the data to ensure the model practiced on the hardest puzzles first.

3. The Training: A Three-Step "Boot Camp"

The team didn't just dump data on the model. They used a Three-Stage Progressive Training strategy, like a martial arts master teaching a student.

Stage 1: The "Eyes and Hands" Drill (Pre-alignment)

Before the student can write an essay, they must learn to point at things.

  • What happens: The model is trained to point to specific words on a page and say what they are (Detection & OCR).
  • The Analogy: Imagine a child learning to read. First, they learn to point at a word and say "Cat." They aren't writing a story yet; they are just learning to connect the shape of the letters to the sound. This grounds the model in reality so it stops guessing where things are.

Stage 2: The "Strict Editor" Drill (Specialized SFT)

Now that the model can point at words, it needs to learn the rules of grammar and formatting.

  • What happens: The model is shown high-quality documents and taught to rewrite them perfectly in Markdown (a simple coding language for text).
  • The Analogy: The model is now a copy editor. It learns that if it sees a header, it must use a #. If it sees a table, it must use pipes |. It learns that a table must close properly, or the whole document breaks. It stops being creative and starts being precise.

Stage 3: The "Referee" Drill (Format-Constrained GRPO)

This is the secret sauce. Even smart models sometimes cheat or get lazy.

  • What happens: They use a technique called GRPO (Group Relative Policy Optimization). Imagine the model generates 5 different versions of a document. A "Referee" (a set of strict rules) checks them:
    • Did the math formula compile? (If no, -10 points).
    • Did the table have the same number of columns in every row? (If no, -10 points).
    • Did all the brackets close? (If no, -10 points).
  • The Analogy: It's like a video game where the model gets a high score only if it follows the rules perfectly. If it tries to "hack" the system by making up a fake table, the referee catches it immediately. The model learns that structure is just as important as content.

4. The Results: Beating the Giants

The team tested their new model, FireRed-OCR, on a tough benchmark called OmniDocBench.

  • The Surprise: FireRed-OCR is a relatively small model (only 2 billion parameters). Compare that to giants like Qwen3-VL-235B (235 billion parameters) or Gemini, which are massive.
  • The Outcome: FireRed-OCR won. It scored 92.94%, beating the massive general models and even the specialized "pipeline" systems that use multiple different tools to do the job.
  • Why it matters: It proves you don't need a super-computer-sized brain to read documents perfectly. You just need the right training data and the right "boot camp" strategy.

Summary: The "FireRed" Magic

Think of FireRed-OCR as taking a talented but messy artist and turning them into a precision architect.

  1. Geometry Factory: They gave the architect a library of every possible building blueprint, not just the easy ones.
  2. Three-Stage Training: They taught the architect to measure first, then draft, then finally, to pass a strict building code inspection.
  3. The Result: A model that doesn't just "guess" what a document says, but reconstructs it with pixel-perfect accuracy, ensuring that every table, formula, and paragraph is exactly where it should be.

They have open-sourced their code and model, meaning anyone can now use this "architect" to turn messy scans into perfect, usable digital documents.