DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

The paper introduces DRetHTR, a decoder-only Retentive Network for handwritten text recognition that achieves state-of-the-art accuracy while significantly improving inference speed and memory efficiency by eliminating the growing KV cache through softmax-free retention and a novel layer-wise gamma scaling mechanism.

Changhun Kim, Martin Mayr, Thomas Gorges, Fei Wu, Mathias Seuret, Andreas Maier, Vincent Christlein

Published 2026-02-20
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot to read messy, handwritten letters from the 1800s. The robot needs to look at a squiggly line of ink (the image) and turn it into typed words (the text).

For a long time, the best robots used a system called a Transformer. Think of a Transformer like a super-smart librarian who, every time they read a new word, has to run back to the beginning of the book, read every single previous word again, and write down a massive summary note to remember the context.

  • The Problem: As the sentence gets longer, this librarian gets slower and slower. They have to carry a growing stack of notes (memory) that gets huge and heavy. If the sentence is long, the librarian gets overwhelmed, takes forever to finish, and runs out of desk space.

The authors of this paper, DRetHTR, built a new kind of robot that solves this problem. They call it a Retentive Network.

The New Robot: The "Smart Note-Taker"

Instead of the librarian running back to the start every time, the new robot is like a smart note-taker who keeps a single, compact mental summary.

  • How it works: When the robot reads a new word, it updates its current summary just a tiny bit. It doesn't need to re-read the whole book.
  • The Result: Whether the sentence is 5 words or 500 words, the robot takes the exact same amount of time to process each new word. It's like walking down a hallway: it takes the same effort to walk the first step as it does the hundredth step. It doesn't get tired or slow down.

The Secret Sauce: Two Different Brains

The tricky part of handwriting recognition is that the robot has to do two things at once:

  1. Look at the picture (Is that a 'b' or a 'd'?).
  2. Understand the grammar (Does "The cat" make sense, or should it be "The bat"?).

Old systems tried to do both with the same "running back to the start" method, which was slow. The DRetHTR robot uses a clever hybrid approach called ARMF (Attention-Retention Modality Fusion):

  • The "Snapshot" Brain (Images): When looking at the handwriting image, the robot uses a "snapshot" method (like the old Transformers). It looks at the whole picture at once to figure out what the letters look like. This is fast because the picture doesn't change.
  • The "Flow" Brain (Text): When reading the words it just wrote, the robot uses the "Smart Note-Taker" method. It flows forward, updating its memory one word at a time without looking back.

The Analogy: Imagine you are dictating a letter to a friend.

  • The Old Way: Every time you say a word, your friend stops, pulls out a giant notebook, reads every word you've said so far to understand the context, and then writes it down.
  • The DRetHTR Way: Your friend listens to the sound of your voice (the image) to know what letter to write, but they keep a running mental list of the sentence structure (the text) that updates instantly. They never have to stop and re-read the whole list.

The "Zoom Lens" Trick

The authors noticed a small problem: If the robot only keeps a simple summary, it might forget the beginning of the sentence by the time it gets to the end.

To fix this, they gave the robot a "Zoom Lens" (Layer-wise Gamma Scaling).

  • Shallow Layers (The Wide Angle): The early parts of the robot's brain focus on local details. They look at the immediate neighbors (e.g., "Is this 'th' or 't' followed by 'h'?").
  • Deep Layers (The Telephoto): The deeper parts of the brain zoom out to see the big picture. They remember the start of the sentence to ensure the grammar makes sense.

This mimics how human attention works: we look closely at the letters right in front of us, but we also keep the whole sentence in mind.

Why Does This Matter?

The paper tested this new robot on famous handwriting datasets (like old diaries and French administrative mail). The results were impressive:

  • Speed: It is 1.6 to 1.9 times faster than the best existing models.
  • Memory: It uses 38–42% less computer memory.
  • Accuracy: It is just as accurate as the slow, heavy models.

In simple terms: They built a handwriting reader that is as smart as the current champions but runs on a lighter engine. It doesn't need a supercomputer to read a long letter; it can do it quickly and efficiently, making it much more practical for real-world use, like digitizing old libraries or processing insurance forms instantly.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →