TAPS: Task Aware Proposal Distributions for Speculative Sampling

This paper demonstrates that training draft models on task-specific data significantly improves speculative decoding performance for corresponding workloads and that combining these specialized drafters at inference time via confidence-based routing and merged-tree verification outperforms naive weight averaging.

Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem

Published 2026-03-31
📖 5 min read🧠 Deep dive

Imagine you are a brilliant but very slow architect (the Target Model) trying to build a skyscraper, one brick at a time. Every time you place a brick, you have to stop, think deeply, check the blueprints, and make sure it's perfect before moving to the next one. This process is accurate but incredibly slow.

To speed things up, you hire a fast, energetic intern (the Draft Model). The intern's job is to quickly guess the next 5 or 10 bricks you might want to place. You then quickly glance at the intern's guesses. If they look right, you accept them all at once and move on. If they look wrong, you discard them and do the work yourself.

This is Speculative Decoding. The goal is to get the intern to guess correctly as often as possible.

The Big Problem: The "Wrong Intern"

The paper TAPS asks a simple but crucial question: Does it matter what the intern studied in school?

In the past, researchers just hired interns who read a little bit of everything (news, chat logs, random facts). They assumed a "generalist" intern would be good at guessing anything.

The authors of this paper say: "No! That's like hiring a generalist to build a rocket ship."

They tested two types of interns:

  1. The Math Whiz: Trained only on math problems and logic puzzles.
  2. The Chatterbox: Trained only on casual conversations and social media chats.

The Results: Specialization Wins

When they put these interns to work:

  • The Math Whiz was amazing at building the "rocket ship" (solving math problems) but terrible at writing a casual email.
  • The Chatterbox was great at writing emails but failed miserably at the math problems.
  • The Generalist (trained on a mix of both) was okay at everything, but not great at anything.

The Lesson: If you know you are going to be doing math, hire the Math Whiz. If you are writing a story, hire the Chatterbox. The quality of the "guessing" depends entirely on how well the intern's training matches the job you are giving them.

The "Merging" Mistake

The researchers then asked: "What if we have both a Math Whiz and a Chatterbox available? Can we just mix their brains together to make one super-intern?"

They tried Weight Averaging (literally mixing the intern's brain weights together).

  • Result: Disaster. The new "Super-Intern" was confused. They tried to be a little bit of both, and ended up being mediocre at both. It was like mixing oil and water; you just get a messy sludge.

The Smart Solution: The "Smart Manager"

Instead of mixing their brains, the researchers tried a Smart Manager approach (Inference-Time Composition).

Imagine you have both interns sitting at the desk. Before you ask them to guess the next bricks, the Manager looks at the current task:

  • "Oh, we are doing a math problem? Math Whiz, you take the lead!"
  • "Oh, we are writing a poem? Chatterbox, you take the lead!"

They found two ways to do this:

  1. Confidence Routing: The Manager asks, "Who feels most confident about this guess?" and picks that intern. This worked very well.
  2. Merged-Tree Verification (The Best Way): The Manager lets both interns shout out their guesses at the same time, but keeps their lists separate. The Architect (Target Model) then checks all the guesses from both lists at once.
    • Analogy: It's like having two different teams of detectives solving a mystery. Instead of forcing them to agree on one theory, you let them both present their theories, and you pick the one that fits the evidence best. This gave the best results of all.

Why Entropy (Confusion) Didn't Work

The researchers also tried to use "Confusion" (Entropy) as a signal. They thought, "If the intern is confused, maybe we should switch to the other intern."

  • Result: It didn't work well. The interns were often confused even when they were right.
  • Better Signal: Confidence. If an intern says, "I am 99% sure this is the right brick," that is a much better signal to trust them than asking, "Are you confused?"

The "Depth" Surprise

Finally, they noticed something interesting about how deep the guesses go.

  • Shallow Guesses (The first few bricks): It helps to have a broad mix of ideas.
  • Deep Guesses (The 10th or 20th brick): You really need the specialist. If you are deep into a complex math proof, the Chatterbox will get lost. The Math Whiz is the only one who can keep the chain of logic going.

Summary in One Sentence

To make AI faster, don't just train a "jack-of-all-trades" assistant; instead, train specialized experts for specific jobs, and use a smart system to pick the right expert (or let them both work) at the exact moment you need them.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →