ConFu: Contemplate the Future for Better Speculative Sampling

This paper introduces ConFu, a novel speculative decoding framework that enhances draft model accuracy by enabling future anticipation through contemplate tokens and soft prompts, thereby achieving an 8–11% improvement in token acceptance rates and generation speed over state-of-the-art methods like EAGLE-3.

Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou Sun

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you are trying to write a long story with a very smart, but very slow, friend (let's call him The Target). The Target is an expert writer, but he thinks very carefully about every single word before he says it. If you ask him to write a whole paragraph, he has to stop and think for a long time for each word. This makes the process incredibly slow.

To speed things up, you decide to hire a quick, energetic assistant (let's call him The Draft). The Draft's job is to guess the next few words of the story and write them down quickly. Then, The Target just glances at what The Draft wrote. If The Draft guessed right, The Target says, "Great, keep going!" and you save a ton of time. If The Draft guessed wrong, The Target has to cross it out and write the correct word himself.

This is called Speculative Decoding. The problem is that The Draft is a bit of a daydreamer. He only looks at the last word you wrote and guesses the next one based on that. But as he keeps guessing, he starts to drift off course. He might guess a word that makes sense locally but doesn't fit the overall story The Target is trying to tell. Eventually, The Target has to reject most of The Draft's guesses, and the speedup disappears.

Enter ConFu (Contemplate the Future)

The authors of this paper realized that The Draft needs to stop just looking at the last word and start thinking ahead. They wanted The Draft to know what The Target is planning to say next, not just what The Target said last.

Here is how ConFu works, using a simple analogy:

1. The "Pause Button" and the "Whisper"

Imagine The Target is about to write a sentence. Before he actually writes the next word, he hits a special Pause Button (called a Contemplate Token).

When he hits this button, he doesn't just stop; he secretly whispers his "thought process" into a special note. This note isn't a full sentence; it's more like a vibe or a direction, like a mental map saying, "I'm about to explain a math problem," or "I'm about to tell a sad story."

This note is called a Soft Prompt. It's a tiny, cheap piece of information that tells The Draft, "Hey, I'm thinking about going in this direction."

2. The Draft Gets the Map

Now, The Draft doesn't just guess blindly. He gets to read The Target's secret note before he starts writing.

  • Without ConFu: The Draft guesses, "The cat sat on the..." and guesses "mat."
  • With ConFu: The Target whispers, "I'm thinking about a cozy living room." The Draft hears this and thinks, "Oh! He's thinking cozy! I'll guess 'rug' instead of 'mat'."

Because The Draft is now aligned with The Target's future plan, his guesses are much more likely to be correct.

3. The "Chameleon" Assistant (MoE)

Here is the tricky part: The Target's "thoughts" change depending on what he's doing. Sometimes he's writing code, sometimes he's telling a joke, and sometimes he's solving a math problem. A single "whisper" isn't enough for all these different situations.

So, ConFu gives The Draft a Chameleon Assistant (using a technology called Mixture-of-Experts).

  • If the story is about math, the Chameleon switches to "Math Mode" and interprets the whisper as "solve for X."
  • If the story is about cooking, it switches to "Recipe Mode" and interprets the whisper as "add salt."

This allows The Draft to understand the context perfectly, no matter what The Target is doing.

4. The "Practice Run" (Training)

To teach The Draft how to use these whispers, the researchers didn't just tell him to guess. They made him practice.

  • They took a long story and randomly picked a few spots (Anchor Tokens) to practice on.
  • They made The Draft guess the future based on the whisper from that spot.
  • Crucially, they told The Draft: "If you get the whisper right for this spot, you should be able to get the guesses right for the words right next to it too."

This made The Draft very robust. He learned that the "future direction" doesn't change instantly; it stays consistent for a little while, helping him make better guesses even when the exact spot shifts slightly.

The Result: A Faster Storyteller

When they tested this new system (ConFu) against the current best method (EAGLE-3), the results were impressive:

  • More Accepted Guesses: The Target accepted The Draft's guesses about 8-11% more often.
  • Faster Speed: Because fewer guesses were rejected, the whole story was written much faster.
  • Better Alignment: The Draft stayed on the "semantic track" of the story much longer, preventing those annoying moments where the story goes off the rails.

Why This Matters

Think of it like a GPS.

  • Old Way: The GPS only tells you the next turn. If you miss it, you're lost and have to recalculate the whole route.
  • ConFu Way: The GPS tells you the next turn and whispers, "In 5 miles, we're going to a beach resort." Because you know the destination, you can anticipate the turns better, stay on the right road, and get there faster without getting lost.

ConFu is a clever trick that lets a fast, simple AI assistant "read the mind" of a slow, smart AI master. By giving the assistant a glimpse of the future, they work together much more efficiently, making AI faster and cheaper to run without losing any of its intelligence.