DODO: Discrete OCR Diffusion Models

The paper introduces DODO, a Vision-Language Model that employs block discrete diffusion to overcome the sequential decoding bottleneck of autoregressive OCR systems, achieving near state-of-the-art accuracy with up to 3x faster inference by addressing the structural instabilities of existing masked diffusion approaches.

Sean Man, Roy Ganz, Roi Ronen, Shahar Tsiper, Shai Mazor, Niv Nayman

Published 2026-02-20
📖 4 min read☕ Coffee break read

The Big Problem: The "Slow Reader"

Imagine you have a massive, 100-page document that you need to turn into digital text (like copying a book into a computer).

For a long time, the best AI models for this job worked like a very careful, slow reader. They read the document one word at a time, from left to right.

  • To write the first word, they look at the picture.
  • To write the second word, they look at the picture and the first word they just wrote.
  • To write the third word, they look at the picture, the first word, and the second word.

This is called Autoregressive decoding. It's accurate, but it's painfully slow for long documents because the computer has to take 100 separate steps to write 100 words. It's like trying to fill a swimming pool by carrying water in a single teaspoon, one scoop at a time.

The Failed Shortcut: The "Chaotic Painter"

Researchers knew that because OCR (Optical Character Recognition) is so precise (the text is the text; there's no room for "maybe"), they should be able to write the whole page at once. They tried using Diffusion Models (the same tech behind AI image generators).

Think of a Diffusion model like a chaotic painter who starts with a blank canvas covered in static noise. They try to "denoise" the image to reveal the text.

  • The Idea: Instead of painting one word, they try to reveal 50 words at the same time.
  • The Problem: In a creative task like writing a poem, if the painter messes up the order of words, they can just fix it later or rephrase the sentence. But in OCR, if the painter puts the word "Apple" in the wrong spot, or guesses the wrong number of words, the whole sentence breaks. It's like trying to assemble a jigsaw puzzle where you guess the shape of 50 pieces at once; if one is slightly off, the whole picture collapses.

The paper calls this the "Structural Paradox": The rigidity that makes OCR easy to predict (it's just one specific answer) also makes it impossible for standard diffusion models to fix their own mistakes.

The Solution: DODO (The "Construction Crew")

The authors introduce DODO (Discrete OCR Diffusion Models). They solved the problem by changing how the model builds the text.

Instead of trying to paint the whole page at once (too risky) or writing one word at a time (too slow), DODO works like a construction crew building a wall in sections.

  1. Block by Block: DODO breaks the document into chunks (blocks) of, say, 256 words.
  2. The Anchor: It finishes the first block completely and "locks it in." This block becomes a solid, unchangeable foundation.
  3. The Parallel Sprint: Once the first block is locked, the model looks at the second block. Because the first block is done, it can now predict all 256 words of the second block simultaneously, knowing exactly where they start.
  4. Repeat: It moves to the third block, and so on.

Why This is a Game Changer

  • Speed: Because it predicts huge chunks of text at once (like a whole paragraph in a single step), it is 3 times faster than the old "slow reader" models.
  • Accuracy: Because it locks in each section before moving to the next, it never gets "lost" or confused about where a word should go. It keeps the precision of the slow reader with the speed of the chaotic painter.

The "Fast" Version: DODO-Fast

The paper also introduces a "Fast" version. Imagine the construction crew doesn't just build the wall; they also save the blueprints of the finished sections.

  • When they move to the next section, they don't need to re-calculate the physics of the wall they just built. They just grab the saved blueprint and keep going.
  • This makes the process even faster, allowing the AI to process documents at lightning speed without losing accuracy.

The Bottom Line

DODO is like upgrading from a person typing a document one letter at a time to a team of typists who can type entire paragraphs simultaneously, but with a strict rule: they must finish one paragraph perfectly before starting the next.

This allows computers to digitize massive libraries of documents in a fraction of the time it used to take, making the "digital transformation" of our physical world much faster and cheaper.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →