Speculative Speculative Decoding

This paper introduces Saguaro, an optimized Speculative Speculative Decoding (SSD) algorithm that further accelerates large language model inference by parallelizing the draft and verification steps to achieve up to 2x speedup over standard speculative decoding and 5x over autoregressive decoding.

Tanishq Kumar, Tri Dao, Avner May

Published 2026-03-04
📖 6 min read🧠 Deep dive

Imagine you are trying to write a long story, but you have a very strict editor (the Target Model) who is incredibly smart but moves very slowly. Every time you write a word, you have to wait for the editor to check it, approve it, and then you can write the next one. This is how most AI chatbots work today: one word at a time, waiting for approval. It's accurate, but it's slow.

To speed this up, researchers invented Speculative Decoding. Here's how that works:
You hire a fast, energetic intern (the Draft Model) who is good at guessing what the editor might say. The intern quickly writes the next 5 words. Then, the slow editor checks all 5 words at once. If the intern was right, great! You saved time. If the intern was wrong, the editor throws them out and writes the correct word.

The Problem: Even with the intern, there's still a bottleneck. The intern has to wait for the editor to finish checking the current batch before they can start guessing the next batch. It's like a relay race where the next runner can't start until the previous runner crosses the finish line and hands off the baton.

Enter: Speculative Speculative Decoding (SSD)

The paper introduces a new method called Speculative Speculative Decoding (SSD), implemented in an algorithm named Saguaro.

Think of Saguaro not as a relay race, but as a high-stakes casino with a crystal ball.

1. The Crystal Ball (Predicting the Future)

In the old method, the intern waits for the editor to finish. In Saguaro, the intern doesn't wait. While the editor is still checking the current batch, the intern uses a "crystal ball" to guess what the editor is going to decide.

The editor has two choices for every batch:

  1. Accept all 5 words.
  2. Accept 3 words, then pick a new 4th word.
  3. Accept 2 words, then pick a new 3rd word.
    ...and so on.

The intern realizes, "I can't know for sure which one the editor will pick, but I can guess the most likely outcomes." So, the intern starts writing multiple future stories simultaneously, one for each possible outcome the editor might choose.

2. The "Pre-Prepared" Menu (The Cache)

Imagine the intern has a kitchen. Instead of waiting for the chef (the editor) to say, "Okay, I liked the first 3 words, now give me the 4th," the intern prepares three different 4th words on three different plates, just in case.

  • Plate A: If the editor accepts 3 words, here is the 4th.
  • Plate B: If the editor accepts 2 words, here is the 4th.
  • Plate C: If the editor accepts 1 word, here is the 4th.

This is called the Speculation Cache. The intern is doing the work in parallel with the editor's checking.

3. The Instant Serve (The Hit)

The moment the editor finishes checking and says, "Okay, I accepted the first 3 words, give me the 4th," the intern doesn't have to start writing. They just grab Plate A from the counter and hand it over instantly.

  • Result: Zero waiting time. The intern's work was done while the editor was working.

4. What if the Crystal Ball was Wrong? (The Fallback)

Sometimes, the editor picks a weird outcome the intern didn't guess (e.g., "I accepted 0 words!"). This is a Cache Miss.
In this case, the intern has to drop the pre-made plates and start writing the next batch from scratch, just like the old method. However, the paper shows that by being smart about which outcomes to guess (focusing on the most likely ones), the intern is right most of the time.

The Three Secret Weapons of Saguaro

The paper identifies three tricky problems and how Saguaro solves them:

  1. The "How Many?" Problem: The intern needs to guess not just what the next word is, but how many words the editor will accept before stopping.

    • Saguaro's Fix: It uses math to figure out that the editor is most likely to accept a few words, rarely all of them, and very rarely none. It builds a "fan-out" strategy, preparing more guesses for the likely outcomes and fewer for the unlikely ones. It's like a restaurant preparing 100 orders of the "Chicken Special" (popular) and only 1 order of the "Frog Legs" (rare).
  2. The "Quality vs. Speed" Trade-off: To guess the future better, the intern might need to change how it writes, which could make its guesses slightly less accurate.

    • Saguaro's Fix: It tweaks the intern's writing style slightly to make the "bonus word" (the one the editor picks after rejecting some) easier to guess. It's a delicate balance: make the intern slightly less perfect at guessing the current words, but much better at guessing the next word, so the whole system runs faster.
  3. The "Big Crowd" Problem: When you have many people asking for stories at once (large batch sizes), the chance of the crystal ball being wrong increases.

    • Saguaro's Fix: It changes its strategy based on the crowd size. If the crowd is small, it uses a slow, super-smart intern to guess. If the crowd is huge, it switches to a fast, simple intern who just guesses randomly. Why? Because with a huge crowd, even a smart guesser will get overwhelmed by errors, so it's better to have a fast backup that doesn't stall the whole line.

The Result

By running the "guessing" and "checking" at the same time on different computers, Saguaro eliminates the waiting time.

  • Old Way: 1x speed.
  • Standard Speculative Decoding: ~1.5x speed.
  • Saguaro (SSD): Up to 2x faster than standard speculative decoding and 5x faster than the old "wait-for-every-word" method.

In a nutshell: Saguaro is like a chef who doesn't wait for the customer to order the dessert before starting to bake it. Instead, the chef bakes three different desserts simultaneously while the customer is still eating the main course. When the customer finally says, "I'll have the chocolate cake," the chef just slides it onto the table instantly. No waiting, just pure speed.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →