Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding

This paper proposes a vocabulary trimming strategy for speculative decoding that formulates draft model vocabulary selection as a constrained optimization problem to balance token coverage and latency, achieving significant throughput improvements and latency reductions by tailoring the vocabulary to domain-specific workloads.

Ofir Ben Shoham

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you are trying to write a long, complex story with a very smart, but slow, AI assistant (the Target Model). You want the story to be perfect, but the AI takes a long time to think of every single word.

To speed things up, you hire a fast, junior assistant (the Draft Model) to guess the next few words first. Then, the smart AI quickly checks if those guesses are right. If they are, the smart AI accepts them all at once, saving a huge amount of time. This is called Speculative Decoding.

The Problem: The Junior Assistant is Too Cluttered

Here's the catch: The junior assistant is trying to guess from a dictionary of 128,000 words (the full vocabulary). Even though the junior assistant is "lightweight," looking up a word in such a massive dictionary takes time.

In fact, the paper argues that the junior assistant is actually the bottleneck. It spends so much time scanning its giant dictionary that it slows down the whole process, negating the speed benefits.

The Solution: A Customized Pocket Dictionary

The authors realized that in any specific job, you rarely use the whole dictionary.

  • If you are writing code, you mostly use programming keywords.
  • If you are doing math, you mostly use numbers and symbols.
  • You almost never use obscure words like "mashed potatoes" or "seismic settlement" in a coding task.

So, the paper proposes Vocabulary Trimming. Instead of giving the junior assistant the whole 128,000-word dictionary, we give them a customized pocket dictionary containing only the most relevant words for the job.

How They Did It: The "Goldilocks" Search

The tricky part is finding the right size for this pocket dictionary.

  • Too small? The junior assistant can't guess the right words because the important word isn't in their pocket. The smart AI has to reject the guess, and we lose time.
  • Too big? The junior assistant is still slow because the dictionary is too heavy to carry.

The authors treated this like a balancing act. They created a formula to find the "Goldilocks" zone:

  1. Coverage: How many of the words the smart AI actually uses are in the pocket dictionary? (We want this high).
  2. Latency: How fast is the junior assistant looking up words? (We want this low).

They used a smart algorithm (called Tree-structured Parzen Estimator) to test thousands of different dictionary sizes automatically, looking for the sweet spot where the assistant is fast but still accurate enough.

The Results: A Supercharged Team

The results were impressive:

  • General Tasks: Even when the junior assistant's dictionary was shrunk by 90% (from 128,000 words down to just ~13,000), the team got 6.7% faster at writing stories, solving math problems, and writing code.
  • Specialized Tasks: When they tailored the dictionary specifically for a job (like "Named Entity Recognition" or "Function Calling"), they could shrink the dictionary even more (down to ~4,000 words). This made the team 20% faster.

The Big Picture Analogy

Think of the Target Model as a Master Chef and the Draft Model as a Prep Cook.

  • Before: The Prep Cook had to run to a massive, warehouse-sized pantry (128k words) to grab ingredients. Even though the Prep Cook is fast, running to the back of the warehouse takes too long.
  • After: The authors gave the Prep Cook a small, organized cart right next to the stove, containing only the 13,000 ingredients used 99% of the time.
  • Result: The Prep Cook grabs ingredients instantly. The Master Chef checks them quickly and approves them. The whole kitchen runs much faster, and the Master Chef doesn't even notice the Prep Cook is using a smaller list because the most common ingredients are still there.

Why This Matters

This paper shows that we don't need to make AI models bigger or more complex to make them faster. Sometimes, we just need to simplify their tools. By giving the "guessing" AI a smaller, smarter dictionary, we can make large language models significantly faster without losing quality.