SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

The Problem: The "Lazy Reader" AI

Imagine you have a brilliant student (the AI) who has memorized a massive library of books. If you ask them a question like, "What is the capital of France?" they can answer instantly because they know the fact.

However, this student has a bad habit called "Modality Laziness."

If you show them a picture of a map with the word "Paris" written on it and ask, "What city is this?", they often ignore the picture entirely. Instead, they just look at the text of your question, guess the answer based on their memory, and say "Paris." They are taking a shortcut. They aren't actually looking at the image; they are just guessing based on the words you typed.

The researchers found that even though these AI models are technically capable of reading text inside images (a skill called OCR), they rarely use it when they can get away with guessing. It's like a student who knows how to read a menu but just orders the "Chef's Special" every time because it's easier than reading the options.

The Diagnosis: The "Visualized Question" Test

To prove the student was being lazy, the researchers created a tricky test called the Visualized Question (VQ).

Normal Test: You show a picture and type a question below it. The student can ignore the picture and just read the text.
Visualized Question Test: The researchers take the question text and paint it directly onto the image. Now, the only way to see the question is to look at the picture. The text channel is gone.

The Result: When the AI was forced to read the question inside the image, its performance dropped significantly (by up to 12.7%). This proved that the AI wasn't actually using its "reading" muscles; it was just relying on shortcuts.

The Solution: "SimpleOCR" (The Training Camp)

The researchers didn't want to rebuild the AI's brain (which would be expensive and slow). Instead, they invented a simple training strategy called SimpleOCR.

Think of SimpleOCR as a training camp with a strict rule:

"You are not allowed to read the question from a text box. You must read it off the wall."

Here is how it works:

The Transformation: Before training, the AI takes every single practice question and "renders" (paints) the text directly onto the image, just like in the diagnostic test.
Random Styles: To make sure the AI doesn't just memorize "blue text on a white background," they randomize the fonts, colors, and sizes. It's like changing the font on a sign every time you walk by, forcing you to actually read the letters rather than recognizing the shape of the sign.
The Constraint: The AI is only trained on these "painted" images. It has no choice but to activate its visual reading pathways to understand the question.

The Magic: Why It Works So Well

The most surprising part is what happens after the training camp is over.

When the researchers stop painting the questions on the images and go back to the normal format (image + text question), the AI is better than before.

The Analogy: Imagine a weightlifter who trains by lifting heavy rocks (the hard VQ format). When they go back to lifting normal dumbbells (the standard format), the dumbbells feel incredibly light, and they lift them with perfect form.
The Result: By forcing the AI to do the hard work of reading text in images during training, it learned to be a "visual thinker" rather than a "text guesser." This made it much better at solving math problems, reading charts, and understanding diagrams, even when the text was presented normally.

Key Takeaways

It's Plug-and-Play: You don't need to change the AI's architecture or add new hardware. You just change the data you feed it (painting the text on the image).
Super Efficient: It achieved these results using 30 times less data than other advanced methods. It's like getting a PhD with a fraction of the study time because the study method was so effective.
No "Cheat Codes": It stops the AI from cheating by guessing based on text prompts. It forces the AI to actually look at the picture.

In short: The paper teaches AI models to stop being lazy guessers and start being careful readers by forcing them to read questions written on pictures during training. Once they learn that skill, they become smarter at everything else, too.

1. Problem Statement: Modality Laziness and the Capability-Utilization Gap

Despite the rapid advancement of Multimodal Large Language Models (MLLMs) in tasks like chart understanding and document analysis, a critical issue remains: do these models genuinely "read" text embedded in images, or do they rely on parametric shortcuts?

The authors identify a phenomenon termed "modality laziness," where models systematically underweight visual evidence when informative text prompts are available.

The Diagnostic: The authors introduce the Visualized-Question (VQ) setting. In standard evaluation, questions are provided as text alongside an image. In the VQ setting, the question text is rendered directly onto the image, and the model receives only a generic instruction (e.g., "Answer the question in the image").
The Finding: Experiments on Qwen2.5-VL-7B revealed a stark "capability-utilization gap." While the model possesses strong OCR capabilities, its performance dropped by up to 12.7% (average 6.9%) in the VQ setting compared to standard text prompts. This indicates that when text shortcuts are removed, the model fails to activate its visual text extraction pathways, relying instead on linguistic priors.

2. Methodology: SimpleOCR

To bridge this gap, the authors propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process without modifying the model architecture or adding auxiliary losses.

Core Mechanism

Data Transformation ( $T_{render}$ ): All training samples are transformed into the VQ format. The question text ( $q_{text}$ ) is rendered onto the image canvas ( $x_{img}$ ) with randomized visual styles (font, color, size between 18–42pt, and CJK support).
Input Context: The input becomes a visual-only context $C_{vq} = (T_{render}(x_{img}, q_{text}), p_{prompt})$ , where $p_{prompt}$ is a generic instruction.
Training Strategy:
- Exclusive VQ Training: The model is trained exclusively on these visualized inputs. This eliminates the possibility of learning text-based shortcuts during the training phase.
- Integration with RL: The method is implemented within the Group Relative Policy Optimization (GRPO) framework. During training, the policy model generates responses based on $C_{vq}$ , computes rewards, and updates parameters.
- Zero-Shot Transfer: Crucially, during inference/evaluation, the model is tested on standard inputs ( $C_{orig}$ : image + text question). This forces the model to learn a format-agnostic reasoning capability that persists even when text shortcuts are restored.

Plug-and-Play Compatibility

SimpleOCR can be seamlessly integrated with existing advanced training strategies. For instance, it can replace the "perturbed image" branch in NoisyRollout (which usually uses image distortions) with the "visual question" branch. This creates a hybrid rollout strategy where the model learns both visual robustness (via image perturbation) and visual text grounding (via VQ).

3. Key Contributions

Diagnosis of Modality Laziness: The paper quantifies the significant performance degradation (up to 12.7%) in MLLMs when forced to read text from images, proving that OCR capability does not guarantee OCR utilization.
SimpleOCR Framework: A novel, architecture-agnostic training strategy that forces visual engagement by rendering questions onto images. It requires no architectural changes and introduces zero inference latency.
Data Efficiency: The method achieves superior performance with 8.5K training samples, demonstrating a 30x reduction in data dependency compared to recent RL-based methods that require 260K+ samples.
Orthogonal Improvement: It addresses a unique dimension of reasoning (visual text grounding) that complements existing methods like NoisyRollout (which focuses on visual robustness against noise).

4. Experimental Results

The authors evaluated SimpleOCR on Qwen2.5-VL (3B and 7B variants) across multiple benchmarks.

Out-of-Distribution (OOD) Generalization:
- SimpleOCR outperformed the base GRPO model by 5.4% on average across four OOD benchmarks (MathVerse, MathVision, MathVista, WeMath, HallusionBench).
- It surpassed the base GRPO model by 2.7% on average.
- Specific Gains: Notable improvements were seen on tasks requiring dense visual text extraction, such as MathVision (+2.4% over GRPO) and ChartQA (+2.1% over GRPO).
In-Domain Performance:
- On the training domains (Geo3K, MMK12), SimpleOCR maintained performance comparable to the base GRPO model, proving it does not sacrifice in-domain capability for OOD gains.
OCR-Intensive Benchmarks:
- On ChartQA, SimpleOCR achieved 81.6% accuracy, reversing the slight performance drop seen in standard GRPO (79.5%).
- On HallusionBench, it reached 69.1%.
Ablation Studies:
- Mixed Strategies: Training with a mix of standard and VQ inputs (50/50) resulted in a performance drop ("U-shaped" curve), confirming that contradictory signals hinder learning. 100% VQ training is required to force the structural constraint.
- Randomization: Randomizing font styles prevented overfitting to specific visual patterns, yielding consistent gains over fixed-style rendering.
- Group Size: An optimal group size of n=6 was found; larger groups (n=9) led to slight performance regression due to potential reward hacking or instability.

5. Significance and Conclusion

The paper fundamentally shifts the focus from acquiring OCR capabilities (which MLLMs already have) to activating them during reasoning.

Mechanism: SimpleOCR proves that "modality laziness" is an optimization preference, not a capability deficit. By structurally blocking text shortcuts during training, the model is compelled to optimize the visual extraction pathway.
Practical Impact: The method offers a highly efficient, low-cost solution to improve MLLM robustness in real-world scenarios where text is embedded in images (e.g., diagrams, screenshots, documents).
Limitations: The method relies on the base model having latent OCR capabilities (a strong vision encoder) and is bounded by the resolution limits of rendering very long text prompts onto a single image.

In summary, SimpleOCR is a highly effective, data-efficient, and plug-and-play strategy that forces MLLMs to "read" images, significantly enhancing their reasoning capabilities in text-rich visual environments.

SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

The Problem: The "Lazy Reader" AI

The Diagnosis: The "Visualized Question" Test

The Solution: "SimpleOCR" (The Training Camp)

The Magic: Why It Works So Well

Key Takeaways

1. Problem Statement: Modality Laziness and the Capability-Utilization Gap

2. Methodology: SimpleOCR

Core Mechanism

Plug-and-Play Compatibility

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Convolutional Surrogate for 3D Discrete Fracture-Matrix Tensor Upscaling

Generating Counterfactual Patient Timelines from Real-World Data

LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

SIEVE: Sample-Efficient Parametric Learning from Natural Language

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models