DODO: Discrete OCR Diffusion Models

The Big Problem: The "Slow Reader"

Imagine you have a massive, 100-page document that you need to turn into digital text (like copying a book into a computer).

For a long time, the best AI models for this job worked like a very careful, slow reader. They read the document one word at a time, from left to right.

To write the first word, they look at the picture.
To write the second word, they look at the picture and the first word they just wrote.
To write the third word, they look at the picture, the first word, and the second word.

This is called Autoregressive decoding. It's accurate, but it's painfully slow for long documents because the computer has to take 100 separate steps to write 100 words. It's like trying to fill a swimming pool by carrying water in a single teaspoon, one scoop at a time.

The Failed Shortcut: The "Chaotic Painter"

Researchers knew that because OCR (Optical Character Recognition) is so precise (the text is the text; there's no room for "maybe"), they should be able to write the whole page at once. They tried using Diffusion Models (the same tech behind AI image generators).

Think of a Diffusion model like a chaotic painter who starts with a blank canvas covered in static noise. They try to "denoise" the image to reveal the text.

The Idea: Instead of painting one word, they try to reveal 50 words at the same time.
The Problem: In a creative task like writing a poem, if the painter messes up the order of words, they can just fix it later or rephrase the sentence. But in OCR, if the painter puts the word "Apple" in the wrong spot, or guesses the wrong number of words, the whole sentence breaks. It's like trying to assemble a jigsaw puzzle where you guess the shape of 50 pieces at once; if one is slightly off, the whole picture collapses.

The paper calls this the "Structural Paradox": The rigidity that makes OCR easy to predict (it's just one specific answer) also makes it impossible for standard diffusion models to fix their own mistakes.

The Solution: DODO (The "Construction Crew")

The authors introduce DODO (Discrete OCR Diffusion Models). They solved the problem by changing how the model builds the text.

Instead of trying to paint the whole page at once (too risky) or writing one word at a time (too slow), DODO works like a construction crew building a wall in sections.

Block by Block: DODO breaks the document into chunks (blocks) of, say, 256 words.
The Anchor: It finishes the first block completely and "locks it in." This block becomes a solid, unchangeable foundation.
The Parallel Sprint: Once the first block is locked, the model looks at the second block. Because the first block is done, it can now predict all 256 words of the second block simultaneously, knowing exactly where they start.
Repeat: It moves to the third block, and so on.

Why This is a Game Changer

Speed: Because it predicts huge chunks of text at once (like a whole paragraph in a single step), it is 3 times faster than the old "slow reader" models.
Accuracy: Because it locks in each section before moving to the next, it never gets "lost" or confused about where a word should go. It keeps the precision of the slow reader with the speed of the chaotic painter.

The "Fast" Version: DODO-Fast

The paper also introduces a "Fast" version. Imagine the construction crew doesn't just build the wall; they also save the blueprints of the finished sections.

When they move to the next section, they don't need to re-calculate the physics of the wall they just built. They just grab the saved blueprint and keep going.
This makes the process even faster, allowing the AI to process documents at lightning speed without losing accuracy.

The Bottom Line

DODO is like upgrading from a person typing a document one letter at a time to a team of typists who can type entire paragraphs simultaneously, but with a strict rule: they must finish one paragraph perfectly before starting the next.

This allows computers to digitize massive libraries of documents in a fraction of the time it used to take, making the "digital transformation" of our physical world much faster and cheaper.

1. Problem Statement

Optical Character Recognition (OCR) is a critical task for digitizing documents, yet current state-of-the-art solutions rely heavily on Autoregressive (AR) Vision-Language Models (VLMs). AR models generate text token-by-token in a strict left-to-right sequence. While accurate, this approach creates a significant computational bottleneck: decoding a long document requires $L$ sequential forward passes for $L$ tokens, leading to high latency and slow inference speeds.

The authors propose using Masked Diffusion Models (MDMs), which can generate tokens in parallel, to overcome this latency. However, they identify a fundamental structural incompatibility:

The Rigidity of OCR: Unlike open-ended tasks (e.g., image captioning) where multiple valid outputs exist, OCR is a deterministic, rigid task with a single correct ground-truth sequence.
Failure of Global Diffusion: Standard MDMs operate on a fixed-length canvas and assume conditional independence between tokens. In flexible tasks, errors in length estimation or token placement can be corrected by paraphrasing. In OCR, these "structural errors" (e.g., predicting the wrong sequence length or misaligning a token's position) are irrecoverable. Once a token is revealed in a global diffusion step, it cannot be revised, leading to fractured, hallucinated, or truncated outputs.

2. Methodology: DODO

To resolve the tension between the efficiency of parallel diffusion and the rigidity of OCR, the authors introduce DODO (Discrete OCR Diffusion Models). The core innovation is the adaptation of Block Discrete Diffusion to the multimodal VLM setting.

Key Architectural Components:

Block Decomposition: Instead of generating the entire document sequence globally, DODO decomposes the generation task into a sequence of causally anchored blocks. The model generates a block of tokens (e.g., 256 tokens) conditioned on the previously committed blocks.
Structural Safety Rails:
- Local Alignment: By restricting the diffusion horizon to a block, the model avoids the "synchronization drift" where tokens are placed at incorrect absolute positions relative to the whole document.
- Dynamic Length Adaptation: The model can determine the length of a block dynamically without needing a perfect global estimate of the total document length upfront.
Training and Inference Variants:
- DODO (Bidirectional): Uses full bidirectional attention. Prefix representations are recomputed at every step to adapt to the current block. This maximizes information flow but prevents KV-caching.
- DODO Fast (Block-Causal): Enforces a strict causal mask where previous blocks cannot attend to the current active block. This ensures that the representations of committed prefixes remain static, enabling the use of exact Key-Value (KV) caching. This drastically reduces computational overhead during inference.
Large Block Sizes: Leveraging the high-confidence nature of OCR, the authors scale the block size up to 256 tokens (significantly larger than the 4–32 tokens used in text-only diffusion), maximizing parallel efficiency.
Sampling Strategy: The model uses Confidence Thresholding (unmasking tokens only when prediction probability exceeds a high threshold, e.g., $p=0.98$ ) to ensure high fidelity, avoiding the errors associated with aggressive top- $K$ sampling.

3. Key Contributions

Identification of Structural Incompatibility: The paper theoretically and empirically demonstrates why standard global masked diffusion fails for OCR due to the inability to recover from length and positional errors in rigid, unimodal tasks.
First Block-Discrete Diffusion VLM: DODO is the first Vision-Language Model to successfully apply block discrete diffusion for document transcription, bridging the gap between AR stability and diffusion speed.
KV-Caching for Diffusion: The introduction of "DODO Fast" demonstrates that block-causal training enables exact KV-caching in diffusion models, a technique previously thought difficult to apply to non-autoregressive generation without accuracy collapse.
State-of-the-Art Efficiency: The work proves that parallel decoding is viable for dense text recognition, achieving speeds previously unattainable for diffusion-based OCR.

4. Experimental Results

The authors evaluated DODO on OmniDocBench (complex layouts, tables, formulas) and Fox-Page-EN (dense text).

Accuracy:
- DODO achieves a Normalized Edit Distance (NED) of 0.066 on OmniDocBench, significantly outperforming other diffusion-based VLMs (e.g., Dimple: 0.856, LLaDA-V: 0.524).
- It matches or exceeds the performance of specialized OCR engines (e.g., MonkeyOCR, Mistral OCR) and autoregressive baselines (Qwen2.5-VL family).
Throughput (Speed):
- DODO (Standard): Achieves ~23 tokens/sec, already faster than the autoregressive Qwen2.5-VL-3B baseline (~21 tokens/sec) despite recomputing the full sequence, due to fewer forward passes.
- DODO Fast: Leverages KV-caching to achieve ~63 tokens/sec, representing a 3x speedup over the autoregressive baseline and a 3x improvement over the standard DODO variant.
Ablation Studies:
- Vanilla global diffusion (even with oracle length guidance) fails to converge on dense text, confirming that block training is a necessary structural prior.
- Smaller block sizes in the "Fast" variant yield better accuracy, suggesting that frequent re-committing of tokens to the static cache helps maintain alignment.

5. Significance

DODO represents a paradigm shift in document understanding:

Breaking the Latency Wall: It demonstrates that OCR does not strictly require autoregressive decoding. By utilizing the deterministic nature of OCR, parallel diffusion can be made robust and accurate.
Practical Deployment: The "DODO Fast" variant offers a practical solution for latency-critical applications (e.g., real-time document scanning), proving that diffusion models can be both accurate and fast.
Generalization to VLMs: The success of block-causal masking and KV-caching in a multimodal setting opens new avenues for accelerating other dense generation tasks in Vision-Language Models beyond just OCR.

In summary, DODO successfully reconciles the stability of causal generation with the efficiency of parallel diffusion, establishing a new standard for high-throughput, high-accuracy document OCR.

DODO: Discrete OCR Diffusion Models

The Big Problem: The "Slow Reader"

The Failed Shortcut: The "Chaotic Painter"

The Solution: DODO (The "Construction Crew")

Why This is a Game Changer

The "Fast" Version: DODO-Fast

The Bottom Line

1. Problem Statement

2. Methodology: DODO

Key Architectural Components:

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Multi-Agent Home Energy Management Assistant

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Fundamentals of Computing Continuous Dynamic Time Warping in 2D under Different Norms

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration