N-gram Injection into Transformers for Dynamic Language Model Adaptation in Handwritten Text Recognition

The Problem: The "Over-Confident" Translator

Imagine you hire a brilliant translator to read handwritten notes. You train this translator exclusively on French recipes. They become a master at reading "flour," "sugar," and "oven." They are so good at French recipes that they can read them even if the handwriting is messy.

Now, imagine you hand them a handwritten medical prescription (the "target" task). The handwriting looks similar to the recipes, but the words are totally different: "dosage," "pill," "heart."

Because the translator was so deeply trained on French recipes, their brain is biased. When they see a messy scribble that looks like a word, their brain automatically guesses, "Oh, that must be 'flour'!" even if it's actually "pill." They are so confident in their training that they fail to recognize the new context.

In the world of computers, this is called Language Shift. Modern AI (Transformers) is great at reading handwriting, but if the words it sees at test time are different from the words it learned during training, its performance crashes.

The Solution: The "Dynamic Dictionary" (NGI)

The authors of this paper propose a clever fix called N-gram Injection (NGI).

Instead of retraining the whole translator (which takes forever and requires thousands of new examples), they give the translator a dynamic dictionary right at the moment of reading.

The Old Way: The AI tries to guess the word based only on what it learned in the past.
The New Way (NGI): As the AI reads the messy handwriting, it simultaneously looks at a "cheat sheet" (an n-gram model) that contains the most likely words for this specific situation.

If the AI is reading a medical form, the cheat sheet says, "Hey, in this context, the next word is likely 'aspirin' or 'dose,' not 'cupcake'." The AI then adjusts its guess instantly.

How It Works: The "Early Intervention"

Most people try to fix this problem after the AI makes a mistake (like a teacher correcting a student's essay at the end). This paper suggests a better approach: Early Injection.

Imagine the AI is a detective solving a mystery.

Standard AI: The detective looks at the clue (the handwriting) and guesses the suspect based on their past cases.
NGI AI: Before the detective even starts guessing, you hand them a file of "Current Suspects" (the n-gram data). The detective looks at the handwriting and the file simultaneously. They learn to weigh the handwriting clues against the current suspect list.

By injecting this information early into the AI's decision-making process, the AI learns to balance what it sees (the image) with what it expects (the language rules) in real-time.

The "N-gram" Concept

What is an n-gram? Think of it as a "predictive text" feature on your phone, but supercharged.

If you type "I am going to the...", your phone knows the next word is likely "store," "park," or "gym."
An n-gram is just a statistical map of these word combinations.
The Magic: You can swap these maps instantly. If you switch from reading recipes to reading medical forms, you just swap the "Recipe Map" for the "Medical Map." The AI doesn't need to be retrained; it just needs the new map.

The "Word Attention Network" (WAN)

The authors also built a new, lightweight AI model called WAN (Word Attention Network) to test this.

Think of big AI models as heavy trucks. They are powerful but slow and expensive to fuel (train).
WAN is a scooter. It's small, fast, and efficient.
The paper shows that even with this small scooter, if you give it the right "Dynamic Dictionary" (NGI), it can outperform the heavy trucks on specific tasks without needing a massive engine.

The Results: Why It Matters

The team tested this on three different handwriting datasets (like switching from reading a student's essay to reading a doctor's note).

Without NGI: When the language changed, the AI's error rate doubled or tripled. It was confused and useless.
With NGI: By swapping the "cheat sheet" (n-gram) to match the new text, the AI's accuracy stayed high. It didn't get confused by the shift.

The Big Win:
Usually, to fix an AI that is confused by new data, you have to feed it thousands of new examples and retrain it for days. This paper shows you can fix it instantly just by changing the language guide (the n-gram) at the moment of reading. No extra training, no extra cost, just a smarter way to look at the data.

Summary Analogy

Imagine you are playing a video game where the rules change every level.

Old AI: You memorize the rules for Level 1. When you get to Level 2, you keep trying to use Level 1 rules and you lose.
This Paper's AI: You are given a rulebook for the current level right before you start. You don't need to relearn the game; you just read the new rulebook and play perfectly.

This method allows computers to read messy handwriting from any context (legal forms, medical notes, historical letters) without needing to be retrained for every single new job.

1. Problem Statement

Handwritten Text Recognition (HTR) systems based on Transformer encoder-decoder architectures have achieved state-of-the-art performance by implicitly learning a language model (LM) during training. However, these systems suffer from a significant performance drop when deployed on target corpora where the language distribution differs from the source training data (a phenomenon known as language shift).

The Core Issue: Standard Transformers become biased toward the source language. When the test data involves different vocabulary, syntax, or character sequences (e.g., recognizing names vs. surnames, or different administrative forms), the implicit LM fails, leading to high Character Error Rates (CER).
Limitations of Existing Solutions:
- Retraining: Requires paired image-text data from the target domain, which is often unavailable.
- Post-processing (Re-scoring): Methods like beam search with external n-grams add significant computational overhead and do not allow the neural network to learn how to integrate linguistic cues during training.
- Late Fusion: Injecting LM scores late in the network often fails to allow the model to fully adapt the integration of language and visual features.

2. Methodology: n-gram Injection (NGI)

The authors propose n-gram injection (NGI), a lightweight method to dynamically adapt a Transformer decoder to an external n-gram language model at inference time without retraining on target images.

A. The Architecture: Word Attention Network (WAN)

To demonstrate the method, the authors introduce WAN, a lightweight FCN-Transformer architecture:

Encoder: A 10-layer Fully Convolutional Network (FCN) for efficient feature extraction from variable-sized text images.
Decoder: A small 2-layer Transformer decoder (2.1M parameters, 10x smaller than standard models like DAN).
Rationale: A smaller model is chosen because the training corpus is limited, and the task (word-level recognition) is less complex than full-page recognition.

B. The Injection Mechanism

The core innovation is the early injection of n-gram probabilities directly into the Transformer decoder's input embeddings.

N-gram Vector Generation:
- At decoding step $t$ , given the previous $n-1$ characters, an n-gram model (stored in ARPA format) computes the probability distribution over the entire character set $K$ .
- This results in a distribution vector $s^{NGI}_t$ .
Noise Injection (Regularization):
- To prevent overfitting to the specific source n-gram vectors and to improve adaptability to unseen target distributions, white noise is added to the n-gram vector with a probability $\tau$ .
- The noisy vector is normalized back to a valid probability distribution.
Integration into Decoder Input:
- The noisy n-gram vector is projected via a feed-forward layer ( $f$ ).
- It is summed with the standard character embeddings ( $\xi(c)$ ) and positional encodings ( $P$ ) to form the new decoder input $X$ :
  $X = f(\phi(S^{NGI})) + \xi(c) + P$
- Here, $\phi$ represents the noise function.

C. Training Strategy

Teacher Forcing Error (TFE): The model is trained with a 10% probability of using incorrect previous predictions. This forces the network to handle context errors and prevents over-reliance on either the internal implicit LM or the external n-gram, improving generalization.
Dynamic Switching: During inference, the system can switch the external n-gram source (e.g., from a "Source" n-gram to a "Target" n-gram) without modifying the network weights. This allows the model to adapt to a new domain using only unpaired text data.

3. Key Contributions

Dynamic Adaptation without Retraining: The first method to inject n-gram distributions directly into a Transformer decoder, allowing the network to dynamically switch language models at inference time to match target distributions without needing target image-text pairs.
Early Injection vs. Post-processing: Unlike traditional re-scoring (post-processing), NGI allows the neural network to learn how to balance visual features with linguistic priors, resolving visual ambiguities more effectively.
Lightweight Architecture (WAN): Introduction of a compact, efficient FCN-Transformer specifically designed for word-level HTR, demonstrating that large models are not strictly necessary for high performance if the language adaptation is handled correctly.
Robustness to Language Shift: Demonstrated that NGI significantly mitigates the performance gap caused by vocabulary and syntactic shifts between source and target domains.

4. Experimental Results

Experiments were conducted on three datasets: IAM (English), RIMES (French), and N2S (Private industrial dataset for names/surnames). The datasets were split into "Source" and "Target" subsets with varying degrees of language shift (Lexicon split and K-means split).

Baseline Performance Drop: Standard models (TrOCR, DAN, SaLT, and WAN without NGI) showed a massive performance drop on target sets.
- Example (IAM K-means split): WAN CER increased from 7.6% (Source) to 23.4% (Target).
- Example (RIMES K-means split): WAN CER increased from 4.3% (Source) to 29.9% (Target).
NGI Performance:
- WAN + NGI significantly reduced the target CER while maintaining source performance.
- IAM K-means: Reduced Target CER from 23.4% to 10.1%.
- RIMES K-means: Reduced Target CER from 29.9% to 19.2%.
- N2S: Reduced Target CER from 6.3% to 4.5%.
Comparison with Post-processing:
- While post-processing with an n-gram (WAN+LM) yielded slightly better results than NGI alone, it incurred higher computational costs due to lattice re-scoring.
- WAN+NGI+LM (combining injection and post-processing) achieved the absolute best results, but WAN+NGI alone offered a near-optimal trade-off with minimal computational overhead.
Ablation Study:
- Removing noise injection led to overfitting.
- Removing Teacher Forcing Error (TFE) degraded generalization.
- Lowering the n-gram order (from 5-gram to 3-gram or 2-gram) caused significant performance drops, confirming that higher-order context is crucial.

5. Significance and Conclusion

This paper addresses a critical bottleneck in real-world HTR deployment: the inability of models to generalize across different linguistic domains without expensive retraining.

Practical Impact: The proposed NGI method allows industrial applications (like form processing) to switch recognition models instantly by simply loading a different n-gram file, leveraging unpaired text data to adapt to new domains.
Efficiency: By injecting the LM early in the decoder, the method avoids the heavy computational cost of beam search re-scoring while enabling the neural network to learn the optimal integration of visual and linguistic cues.
Future Work: The framework is designed to be compatible with neural external language models, suggesting potential for cross-lingual transfer tasks if computational constraints are managed.

In summary, the authors successfully demonstrate that explicitly injecting external n-gram distributions into the input of a Transformer decoder is a highly effective, low-cost strategy for dynamic language model adaptation in Handwritten Text Recognition.