DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

Imagine you are trying to teach a robot to read messy, handwritten letters from the 1800s. The robot needs to look at a squiggly line of ink (the image) and turn it into typed words (the text).

For a long time, the best robots used a system called a Transformer. Think of a Transformer like a super-smart librarian who, every time they read a new word, has to run back to the beginning of the book, read every single previous word again, and write down a massive summary note to remember the context.

The Problem: As the sentence gets longer, this librarian gets slower and slower. They have to carry a growing stack of notes (memory) that gets huge and heavy. If the sentence is long, the librarian gets overwhelmed, takes forever to finish, and runs out of desk space.

The authors of this paper, DRetHTR, built a new kind of robot that solves this problem. They call it a Retentive Network.

The New Robot: The "Smart Note-Taker"

Instead of the librarian running back to the start every time, the new robot is like a smart note-taker who keeps a single, compact mental summary.

How it works: When the robot reads a new word, it updates its current summary just a tiny bit. It doesn't need to re-read the whole book.
The Result: Whether the sentence is 5 words or 500 words, the robot takes the exact same amount of time to process each new word. It's like walking down a hallway: it takes the same effort to walk the first step as it does the hundredth step. It doesn't get tired or slow down.

The Secret Sauce: Two Different Brains

The tricky part of handwriting recognition is that the robot has to do two things at once:

Look at the picture (Is that a 'b' or a 'd'?).
Understand the grammar (Does "The cat" make sense, or should it be "The bat"?).

Old systems tried to do both with the same "running back to the start" method, which was slow. The DRetHTR robot uses a clever hybrid approach called ARMF (Attention-Retention Modality Fusion):

The "Snapshot" Brain (Images): When looking at the handwriting image, the robot uses a "snapshot" method (like the old Transformers). It looks at the whole picture at once to figure out what the letters look like. This is fast because the picture doesn't change.
The "Flow" Brain (Text): When reading the words it just wrote, the robot uses the "Smart Note-Taker" method. It flows forward, updating its memory one word at a time without looking back.

The Analogy: Imagine you are dictating a letter to a friend.

The Old Way: Every time you say a word, your friend stops, pulls out a giant notebook, reads every word you've said so far to understand the context, and then writes it down.
The DRetHTR Way: Your friend listens to the sound of your voice (the image) to know what letter to write, but they keep a running mental list of the sentence structure (the text) that updates instantly. They never have to stop and re-read the whole list.

The "Zoom Lens" Trick

The authors noticed a small problem: If the robot only keeps a simple summary, it might forget the beginning of the sentence by the time it gets to the end.

To fix this, they gave the robot a "Zoom Lens" (Layer-wise Gamma Scaling).

Shallow Layers (The Wide Angle): The early parts of the robot's brain focus on local details. They look at the immediate neighbors (e.g., "Is this 'th' or 't' followed by 'h'?").
Deep Layers (The Telephoto): The deeper parts of the brain zoom out to see the big picture. They remember the start of the sentence to ensure the grammar makes sense.

This mimics how human attention works: we look closely at the letters right in front of us, but we also keep the whole sentence in mind.

Why Does This Matter?

The paper tested this new robot on famous handwriting datasets (like old diaries and French administrative mail). The results were impressive:

Speed: It is 1.6 to 1.9 times faster than the best existing models.
Memory: It uses 38–42% less computer memory.
Accuracy: It is just as accurate as the slow, heavy models.

In simple terms: They built a handwriting reader that is as smart as the current champions but runs on a lighter engine. It doesn't need a supercomputer to read a long letter; it can do it quickly and efficiently, making it much more practical for real-world use, like digitizing old libraries or processing insurance forms instantly.

1. Problem Statement

Handwritten Text Recognition (HTR) has largely transitioned from Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) to Transformer-based architectures due to their superior ability to model long-range dependencies and parallelize training. However, state-of-the-art Transformer decoders suffer from significant inefficiencies during autoregressive inference:

Quadratic Memory/Time Complexity: Standard self-attention requires storing a Key-Value (KV) cache that grows linearly with the sequence length ( $N$ ), leading to $O(N^2)$ memory and $O(N)$ per-token decoding time.
Scalability Issues: As sequence lengths increase, the growing KV cache becomes a bottleneck for both memory usage and inference speed, limiting the practical deployment of large models.

While RNNs offer linear memory and constant-time decoding, they lack the parallel training capabilities and global context modeling of Transformers. The paper aims to bridge this gap by creating an HTR model that achieves Transformer-level accuracy with RNN-like inference efficiency.

2. Methodology: DRetHTR Architecture

The authors propose DRetHTR, a decoder-only model built on Retentive Networks (RetNet). RetNet replaces the softmax attention mechanism with a "retention" mechanism that allows for parallel training but recurrent (constant-time) inference.

Key Architectural Components:

Decoder-Only Design: Unlike traditional encoder-decoder HTR models, DRetHTR processes image and text tokens within a single decoder stream, leveraging pre-trained language priors (similar to DTrOCR).
Image Embedding: Instead of standard patch embeddings, the model uses EfficientNetV2 (pre-trained on ImageNet) as a feature extractor. The output is reshaped into a sequence of image tokens.
Attention-Retention Modality Fusion (ARMF): This is the core innovation. A naive application of retention to all tokens (image + text) would weaken image-text alignment. Conversely, full attention creates a growing KV cache.
- Hybrid Mechanism: ARMF applies softmax attention only for interactions involving image tokens (Image-Image and Image-Text). This ensures strong visual-text alignment without a growing cache for the image part (computed once in parallel).
- Causal Retention: For Text-Text interactions, the model uses softmax-free retention with a causal decay matrix. This allows the text stream to be decoded recurrently with $O(1)$ time and $O(N)$ memory complexity.
Layer-wise Gamma Scaling: Standard RetNet uses a fixed decay factor ( $\gamma$ $γ$ ), which can struggle to capture the "local-to-global" inductive bias inherent in Transformers (where shallow layers focus on local details and deep layers on global context).
- Solution: The authors propose a layer-wise scaling strategy where $\gamma$ increases progressively from shallow to deep layers.
- Effect: Shallow layers use smaller $\gamma$ to emphasize short-range dependencies (local strokes), while deeper layers use larger $\gamma$ to aggregate broader context, effectively mimicking the receptive field expansion of Transformers without softmax.

3. Key Contributions

DRetHTR Model: The first decoder-only RetNet architecture specifically tailored for line-level handwritten text recognition.
ARMF Layer: A novel fusion layer that combines the alignment strength of softmax attention (for image tokens) with the efficiency of recurrent retention (for text tokens), avoiding the KV cache growth associated with full attention.
Layer-wise Gamma Scaling: A mechanism to recover the local-to-global inductive bias of Transformers in a softmax-free setting, improving accuracy without sacrificing linear-time decoding.
Efficiency Breakthrough: Demonstrates that HTR can be performed with linear-time and linear-memory decoding while maintaining state-of-the-art accuracy.

4. Experimental Results

The model was evaluated on four major benchmarks: IAM (English), RIMES (French), READ-2016 (German), and Bentham (English historical manuscripts).

Accuracy: DRetHTR achieved state-of-the-art or competitive Character Error Rates (CER):
- IAM: 2.26%
- RIMES: 1.81%
- Bentham: 3.46%
- READ-2016: 4.21%
- Note: These results match or exceed decoder-only Transformers (DTrOCR) and hybrid CTC-Transformer models.
Inference Speed: DRetHTR is 1.6× to 1.9× faster than an equally sized decoder-only Transformer baseline.
Memory Efficiency: It reduces memory usage by 38–42% compared to the Transformer baseline.
Ablation Studies:
- Decoder-Only vs. Encoder-Decoder: The decoder-only approach outperformed encoder-decoder variants when combined with synthetic pre-training.
- Gamma Scaling: Layer-wise scaling significantly improved CER (from 4.65% to 4.49% on IAM), matching the Transformer baseline while keeping text interactions softmax-free.
- Image Backbone: EfficientNetV2-S outperformed ResNet50 and ShallowCNN, proving its suitability for capturing stroke cues in low-data regimes.

5. Significance and Impact

Efficiency for Long Sequences: By eliminating the growing KV cache, DRetHTR is particularly advantageous for long document transcription tasks where memory constraints usually limit Transformer deployment.
Scalability: The linear memory complexity allows for larger beam sizes during decoding without the exponential memory explosion seen in KV-cached Transformers, leading to more robust search without hardware bottlenecks.
Practical Deployment: The combination of high accuracy and low latency makes this architecture highly suitable for real-time applications in digitizing archives, healthcare, and finance, where processing speed and memory footprint are critical.
Theoretical Insight: The paper successfully demonstrates that the "local-to-global" inductive bias of Transformers can be replicated in recurrent, softmax-free networks through structured decay schedules, challenging the notion that softmax attention is strictly necessary for high-performance sequence modeling.

In conclusion, DRetHTR establishes Retentive Networks as a viable, high-performance alternative to Transformers for HTR, offering a new paradigm for efficient, scalable, and accurate handwritten text recognition.

DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

The New Robot: The "Smart Note-Taker"

The Secret Sauce: Two Different Brains

The "Zoom Lens" Trick

Why Does This Matter?

1. Problem Statement

2. Methodology: DRetHTR Architecture

Key Architectural Components:

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Multi-Agent Home Energy Management Assistant

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Fundamentals of Computing Continuous Dynamic Time Warping in 2D under Different Norms

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration