ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

The paper introduces ScribeTokens, a fixed-vocabulary tokenization method for digital ink that decomposes pen movements into unit pixel steps, demonstrating superior performance over vector representations in both handwritten text generation and recognition, particularly when enhanced by a novel next-ink-token prediction pretraining strategy.

Douglass Wang

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot to understand your handwriting. You have a digital pen, and every time you move it, the computer records a stream of coordinates (x, y points). The problem is: how do you translate this messy stream of numbers into something a smart AI can easily read and write?

This paper introduces a new solution called ScribeTokens. Here is the breakdown using simple analogies.

The Problem: Three Ways to Describe a Drawing

Before ScribeTokens, researchers tried three main ways to describe handwriting to a computer, and each had a major flaw:

  1. The "Continuous Stream" (Vectors):

    • The Analogy: Imagine describing a drawing by saying, "Move 0.004 inches right, then 0.003 inches up, then 0.001 inches left..."
    • The Flaw: This creates a massive, endless list of numbers. It's like trying to describe a movie by listing every single frame. It's slow, hard to train, and the AI gets confused easily.
  2. The "Pixel Map" (Existing Tokens):

    • The Analogy: Imagine breaking the drawing into a grid and saying, "Go to square (10, 5), then square (10, 6)..."
    • The Flaw: This is better, but if the AI sees a square it hasn't seen before (like a weirdly placed dot), it panics. It's like a dictionary that breaks if you use a word that isn't in it. Also, the dictionary needs to be huge to cover every possible square on a page.
  3. The "Number String" (Text Tokens):

    • The Analogy: Turning the coordinates into text like "1, 0, 2, minus 5..."
    • The Flaw: The AI might get the numbers right but put them in the wrong order, creating a "word salad" that doesn't make sense as a drawing.

The Solution: ScribeTokens (The "Step-by-Step" Approach)

The authors realized that instead of describing where you are, we should describe how you move.

The Core Idea:
Imagine you are walking through a city grid. Instead of giving someone your GPS coordinates, you just give them a list of simple directions: "Step Right, Step Right, Step Up, Step Right."

ScribeTokens does exactly this:

  1. The 10-Word Vocabulary: It breaks every pen stroke down into tiny, unit steps. There are only 10 possible instructions:

    • 8 directions (Up, Down, Left, Right, and the 4 diagonals).
    • 2 states (Pen Down / Pen Up).
    • Analogy: It's like a robot that only knows 10 commands. No matter how complex the drawing is, it's just a long sentence made of these 10 words.
  2. No "Unknown" Words: Because every movement is just a combination of these 10 steps, the AI will never encounter a word it doesn't know. It's impossible to draw something the AI can't describe.

  3. Compression (The "Zip File"): Even with just 10 words, a long sentence is still long. So, they use a trick called BPE (Byte-Pair Encoding).

    • Analogy: If the AI sees "Step Right, Step Right, Step Right" often, it creates a new shortcut symbol for "Triple Step Right." This shrinks the data size significantly, making the AI faster and smarter.

Why This Changes Everything

The paper tested this new method against the old ones, and the results were surprising:

  • For Writing (Generation):

    • Old Way: When asked to write "Hello," the old vector method produced gibberish (70% error).
    • ScribeTokens: It wrote "Hello" almost perfectly (17% error).
    • Why: It's much easier for an AI to learn a sequence of simple steps (like a dance routine) than to predict exact floating-point numbers.
  • For Reading (Recognition):

    • Old Way: Token methods usually struggled to read handwriting better than the old vector methods.
    • ScribeTokens: It became the first token method to beat the vector method without needing extra training.
  • The "Pre-training" Secret Sauce:

    • The authors added a special training phase where they asked the AI: "I've drawn half a word; what is the next step?"
    • Analogy: It's like teaching a child to read by having them finish sentences before showing them the answers.
    • Result: This made the AI learn 83 times faster and improved its reading ability significantly.

The Bottom Line

ScribeTokens is like translating a complex, messy language of handwriting into a simple, universal "step-by-step" instruction manual.

  • It removes the confusion of "unknown" coordinates.
  • It shrinks the data so the AI can think faster.
  • It makes the AI much better at both reading your handwriting and generating new handwriting that looks human.

In short: Instead of telling the AI where the pen is, ScribeTokens tells the AI how to move the pen, one tiny step at a time. And it turns out, that's exactly how humans learn to write, too.