ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

Imagine you are trying to teach a robot to understand your handwriting. You have a digital pen, and every time you move it, the computer records a stream of coordinates (x, y points). The problem is: how do you translate this messy stream of numbers into something a smart AI can easily read and write?

This paper introduces a new solution called ScribeTokens. Here is the breakdown using simple analogies.

The Problem: Three Ways to Describe a Drawing

Before ScribeTokens, researchers tried three main ways to describe handwriting to a computer, and each had a major flaw:

The "Continuous Stream" (Vectors):
- The Analogy: Imagine describing a drawing by saying, "Move 0.004 inches right, then 0.003 inches up, then 0.001 inches left..."
- The Flaw: This creates a massive, endless list of numbers. It's like trying to describe a movie by listing every single frame. It's slow, hard to train, and the AI gets confused easily.
The "Pixel Map" (Existing Tokens):
- The Analogy: Imagine breaking the drawing into a grid and saying, "Go to square (10, 5), then square (10, 6)..."
- The Flaw: This is better, but if the AI sees a square it hasn't seen before (like a weirdly placed dot), it panics. It's like a dictionary that breaks if you use a word that isn't in it. Also, the dictionary needs to be huge to cover every possible square on a page.
The "Number String" (Text Tokens):
- The Analogy: Turning the coordinates into text like "1, 0, 2, minus 5..."
- The Flaw: The AI might get the numbers right but put them in the wrong order, creating a "word salad" that doesn't make sense as a drawing.

The Solution: ScribeTokens (The "Step-by-Step" Approach)

The authors realized that instead of describing where you are, we should describe how you move.

The Core Idea:
Imagine you are walking through a city grid. Instead of giving someone your GPS coordinates, you just give them a list of simple directions: "Step Right, Step Right, Step Up, Step Right."

ScribeTokens does exactly this:

The 10-Word Vocabulary: It breaks every pen stroke down into tiny, unit steps. There are only 10 possible instructions:
- 8 directions (Up, Down, Left, Right, and the 4 diagonals).
- 2 states (Pen Down / Pen Up).
- Analogy: It's like a robot that only knows 10 commands. No matter how complex the drawing is, it's just a long sentence made of these 10 words.
No "Unknown" Words: Because every movement is just a combination of these 10 steps, the AI will never encounter a word it doesn't know. It's impossible to draw something the AI can't describe.
Compression (The "Zip File"): Even with just 10 words, a long sentence is still long. So, they use a trick called BPE (Byte-Pair Encoding).
- Analogy: If the AI sees "Step Right, Step Right, Step Right" often, it creates a new shortcut symbol for "Triple Step Right." This shrinks the data size significantly, making the AI faster and smarter.

Why This Changes Everything

The paper tested this new method against the old ones, and the results were surprising:

For Writing (Generation):
- Old Way: When asked to write "Hello," the old vector method produced gibberish (70% error).
- ScribeTokens: It wrote "Hello" almost perfectly (17% error).
- Why: It's much easier for an AI to learn a sequence of simple steps (like a dance routine) than to predict exact floating-point numbers.
For Reading (Recognition):
- Old Way: Token methods usually struggled to read handwriting better than the old vector methods.
- ScribeTokens: It became the first token method to beat the vector method without needing extra training.
The "Pre-training" Secret Sauce:
- The authors added a special training phase where they asked the AI: "I've drawn half a word; what is the next step?"
- Analogy: It's like teaching a child to read by having them finish sentences before showing them the answers.
- Result: This made the AI learn 83 times faster and improved its reading ability significantly.

The Bottom Line

ScribeTokens is like translating a complex, messy language of handwriting into a simple, universal "step-by-step" instruction manual.

It removes the confusion of "unknown" coordinates.
It shrinks the data so the AI can think faster.
It makes the AI much better at both reading your handwriting and generating new handwriting that looks human.

In short: Instead of telling the AI where the pen is, ScribeTokens tells the AI how to move the pen, one tiny step at a time. And it turns out, that's exactly how humans learn to write, too.

1. Problem Statement

Digital ink (coordinate streams from stylus/touch input) lacks a unified representation that balances efficiency, stability, and performance. Existing approaches suffer from significant limitations:

Vector Representations: Encode ink as continuous coordinates with pen-up/down flags. They produce long sequences, require complex input normalization, and rely on Mixture Density Networks (MDNs) for generation, which are prone to training instability, mode collapse, and uninterpretable likelihoods.
Existing Token Representations:
- Coordinate-based (AbsTokens/RelTokens): Suffer from Out-of-Vocabulary (OOV) issues when encountering unseen coordinates and require vocabularies that scale with canvas resolution.
- Text-based (TextTokens): Serialize coordinates as strings. While OOV-free, they possess fragile syntax where autoregressive models can generate malformed sequences that do not decode into valid ink.
- General Performance: Existing token methods generally underperform vector representations on recognition tasks and lack robust pretraining strategies.

2. Methodology: ScribeTokens

The authors propose ScribeTokens, a novel tokenization scheme that decomposes pen movement into unit pixel steps, creating a canonical, OOV-free representation.

Core Mechanism

Quantization: Coordinates are rounded to a discrete grid with spacing $\delta$ .
Bresenham Decomposition (BD):
- Instead of treating points as discrete tokens, the algorithm rasterizes the straight-line segment between two points using Bresenham's line algorithm.
- This generates a sequence of adjacent grid cells.
- Transitions between these cells are encoded using Freeman chain codes (8 directional tokens: $\rightarrow, \uparrow, \leftarrow, \downarrow, \nearrow, \nwarrow, \swarrow, \searrow$ ).
Fixed Vocabulary:
- Base Tokens: 8 direction tokens + 2 pen-state tokens ([DOWN], [UP]). Total base vocabulary = 10 tokens.
- Robust Syntax: Every token sequence decodes into valid ink. There are no OOV issues by construction because any path can be decomposed into these 10 base steps.
Compression: Byte-Pair Encoding (BPE) is applied to the direction tokens to merge frequent sequences, significantly reducing sequence length. Crucially, pen-state tokens ([UP], [DOWN]) are never merged, preserving explicit stroke boundaries.
Sampling Invariance: Different sampling rates or point densities that rasterize to the same grid path produce identical token sequences, improving generalization.

Task Formulation

The authors unify three tasks under a causal prompt-completion framework:

Next-Ink-Token Prediction (NTP): Self-supervised pretraining where the model predicts the next token in an ink sequence ( $x=\emptyset, y=s$ ).
Handwritten Text Recognition (HTR): Predicting text from ink ( $x=s, y=c$ ).
Handwritten Text Generation (HTG): Generating ink from text ( $x=c, y=s$ ).

3. Key Contributions

ScribeTokens Architecture: A fixed 10-token base vocabulary that eliminates OOV issues, supports aggressive BPE compression, and ensures robust syntax (all sequences are valid).
Superior Generation Performance: Demonstrated that token representations are far superior to vectors for generation tasks. ScribeTokens achieved a 17.33% Character Error Rate (CER) vs. 70.29% for vectors on sentence-level data.
Pretraining Strategy: Introduced Next-Ink-Token Prediction as a self-supervised pretraining strategy.
- It consistently improves recognition across all token models.
- It accelerates convergence by up to 83× (reaching baseline loss in 1 epoch vs. 83 epochs).
- Unlike vectors, where pretraining can degrade performance, it significantly boosts token-based models.
State-of-the-Art Recognition: With pretraining, ScribeTokens achieved the best recognition results on both datasets (IAM and DeepWriting), outperforming vectors without pretraining and matching/exceeding them with pretraining.

4. Experimental Results

Experiments were conducted on the IAM (sentence-level, small data) and DeepWriting (word-level, large data) datasets.

Metric	Finding
Recognition (HTR)	ScribeTokens + Pretraining achieved the best results: 8.27% CER (IAM) and 9.83% CER (DeepWriting). Without pretraining, ScribeTokens was the only token method to outperform vectors (Point-5) on DeepWriting.
Generation (HTG)	Token representations vastly outperformed vectors. ScribeTokens + Pretraining achieved 10.45% CER on IAM, compared to 70.29% for untrained vectors.
Compression	ScribeTokens achieved the highest compression ratios across all vocabulary sizes and quantization parameters ( $\delta$ ).
Convergence Speed	Pretraining accelerated convergence by 21.4× for HTR and 83.0× for HTG on IAM.
Failure Modes	TextTokens failed completely on IAM without pretraining (82% CER) due to "double descent" and attention collapse into a language model (ignoring ink). Vectors suffered from long sequences and instability.

5. Significance and Insights

Representation Matters: The choice of ink representation fundamentally dictates the efficacy of self-supervised pretraining. Token representations benefit massively from next-token prediction because they must learn spatial relationships from scratch (cold-start problem), whereas continuous vectors already encode spatial proximity.
Attention Mechanism: Attention analysis revealed that successful models (ScribeTokens) allocate significantly more attention to the ink signal (up to 91.2%) compared to failed models like TextTokens (~50%), which rely too heavily on text priors.
Paradigm Shift: The paper suggests that digital ink modeling should move away from continuous vector regression toward discrete tokenization combined with BPE and self-supervised pretraining, mirroring the success of Large Language Models (LLMs).
Practical Impact: ScribeTokens provides a robust, OOV-free foundation for handwriting recognition and generation, enabling faster training and better performance in data-scarce regimes.

Conclusion

ScribeTokens resolves the trade-offs between vector and token representations by offering a fixed, small vocabulary that is lossless, OOV-free, and highly compressible. Combined with next-ink-token pretraining, it sets a new state-of-the-art for both handwriting recognition and generation, proving that discrete tokenization is the superior modality for modeling digital ink in transformer-based architectures.

ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

The Problem: Three Ways to Describe a Drawing

The Solution: ScribeTokens (The "Step-by-Step" Approach)

Why This Changes Everything

The Bottom Line

1. Problem Statement

2. Methodology: ScribeTokens

Core Mechanism

Task Formulation

3. Key Contributions

4. Experimental Results

5. Significance and Insights

Conclusion

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization