Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild

Imagine you have a brilliant student who has aced every single test in a perfectly sterile, climate-controlled laboratory. They can read a textbook, solve a math problem, and organize a spreadsheet with 100% accuracy. You are so impressed that you hire them to work in a real-world office.

But then, things go wrong. The student is asked to read a page from a book that is curved at the spine, a receipt crumpled in a pocket, a photo of a screen taken under a flickering lamp, or a document held at a weird angle. Suddenly, the student starts making mistakes. They can't read the text because the lines are wavy, or they get confused by the shadows.

This is exactly the problem the paper Real5-OmniDocBench is trying to solve.

The Problem: The "Perfect Lab" vs. The "Messy Real World"

For a long time, computers (specifically AI models called Vision-Language Models) have been tested on "born-digital" documents. These are images that come straight from a computer file—perfectly flat, perfectly lit, and perfectly aligned. It's like testing a car only on a smooth, empty racetrack. The car looks amazing.

But in the real world, documents are messy. They get bent, folded, photographed in the dark, or scanned with a shaky hand. The paper argues that just because an AI is great at reading a perfect PDF doesn't mean it can read a crumpled receipt from a coffee shop.

The Solution: The "Physical Reconstruction"

The researchers from Baidu and HKUST decided to build a new testing ground. They took a famous, perfect test set (OmniDocBench) containing 1,355 documents and did something very physical:

They printed them out: They took every single digital page and printed it on high-quality paper.
They messed them up on purpose: They created five different "disaster scenarios" for every single page:
- Scanning: Like putting a page in a scanner but doing it slightly crooked or with a staple in the corner.
- Warping: Like bending the paper into a cylinder, crumpling it, or curling the edges like a dog-eared book page.
- Screen-Photography: Taking a photo of a document displayed on a computer or phone screen (which creates weird patterns called "moiré").
- Illumination: Taking photos in the dark, with a flashlight shining too bright, or with weird colored shadows.
- Skew: Holding the camera at a crazy angle, making the document look like a trapezoid.
They kept the "Answer Key": This is the magic part. Because they started with the perfect digital version, they still knew exactly what the text should say. This allows them to compare the AI's answer on the "messy photo" against the "perfect original" to see exactly how and why it failed.

The Big Discovery: Small is Sometimes Better

The researchers tested 15 different AI models on this new, tough benchmark. They found something surprising that challenges the idea that "bigger is always better."

The Giants: Huge, general-purpose AI models (like the ones with hundreds of billions of parameters) did okay, but they often stumbled when the paper was bent or the lighting was bad. They are like a genius who knows everything but gets confused by a messy room.
The Specialists: A smaller, specialized model called PaddleOCR-VL-1.5 (which is tiny compared to the giants) actually won. It handled the crumpled paper, the weird angles, and the bad lighting better than the massive models.

The Analogy: Think of the giant models as a Swiss Army Knife with a thousand tools. It's impressive, but if you need to cut a specific type of rope, it might be clumsy. The small model is like a high-quality, specialized pair of scissors. It does one thing (reading messy documents) incredibly well because it was trained specifically for that job.

Why This Matters

This paper is like a "stress test" for the future of AI. It tells us that:

Current AI is fragile: If you build a system that only works on perfect digital files, it will fail in the real world.
We need better training: To make AI truly useful, we need to train it on messy, physical data, not just perfect digital files.
Size isn't everything: You don't need a supercomputer to solve real-world problems; sometimes, a smart, specialized tool works better.

In short, Real5-OmniDocBench is a mirror held up to the AI community, showing them that while their models are perfect in the lab, they still have a lot of work to do before they can handle the messy, unpredictable reality of our physical world.

Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild

The Problem: The "Perfect Lab" vs. The "Messy Real World"

The Solution: The "Physical Reconstruction"

The Big Discovery: Small is Sometimes Better

Why This Matters

1. Problem Statement

2. Methodology

A. Data Construction & Acquisition

B. Five Critical Physical Scenarios

C. Quality Control

D. Evaluation Metrics

3. Key Contributions

4. Experimental Results

5. Significance

Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild

The Problem: The "Perfect Lab" vs. The "Messy Real World"

The Solution: The "Physical Reconstruction"

The Big Discovery: Small is Sometimes Better

Why This Matters

1. Problem Statement

2. Methodology

A. Data Construction & Acquisition

B. Five Critical Physical Scenarios

C. Quality Control

D. Evaluation Metrics

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Founder effects shape the evolutionary dynamics of multimodality in open LLM families

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs

Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation

Graphs RAG at Scale: Beyond Retrieval-Augmented Generation With Labeled Property Graphs and Resource Description Framework for Complex and Unknown Search Spaces

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search