ConFu: Contemplate the Future for Better Speculative Sampling

Imagine you are trying to write a long story with a very smart, but very slow, friend (let's call him The Target). The Target is an expert writer, but he thinks very carefully about every single word before he says it. If you ask him to write a whole paragraph, he has to stop and think for a long time for each word. This makes the process incredibly slow.

To speed things up, you decide to hire a quick, energetic assistant (let's call him The Draft). The Draft's job is to guess the next few words of the story and write them down quickly. Then, The Target just glances at what The Draft wrote. If The Draft guessed right, The Target says, "Great, keep going!" and you save a ton of time. If The Draft guessed wrong, The Target has to cross it out and write the correct word himself.

This is called Speculative Decoding. The problem is that The Draft is a bit of a daydreamer. He only looks at the last word you wrote and guesses the next one based on that. But as he keeps guessing, he starts to drift off course. He might guess a word that makes sense locally but doesn't fit the overall story The Target is trying to tell. Eventually, The Target has to reject most of The Draft's guesses, and the speedup disappears.

Enter ConFu (Contemplate the Future)

The authors of this paper realized that The Draft needs to stop just looking at the last word and start thinking ahead. They wanted The Draft to know what The Target is planning to say next, not just what The Target said last.

Here is how ConFu works, using a simple analogy:

1. The "Pause Button" and the "Whisper"

Imagine The Target is about to write a sentence. Before he actually writes the next word, he hits a special Pause Button (called a Contemplate Token).

When he hits this button, he doesn't just stop; he secretly whispers his "thought process" into a special note. This note isn't a full sentence; it's more like a vibe or a direction, like a mental map saying, "I'm about to explain a math problem," or "I'm about to tell a sad story."

This note is called a Soft Prompt. It's a tiny, cheap piece of information that tells The Draft, "Hey, I'm thinking about going in this direction."

2. The Draft Gets the Map

Now, The Draft doesn't just guess blindly. He gets to read The Target's secret note before he starts writing.

Without ConFu: The Draft guesses, "The cat sat on the..." and guesses "mat."
With ConFu: The Target whispers, "I'm thinking about a cozy living room." The Draft hears this and thinks, "Oh! He's thinking cozy! I'll guess 'rug' instead of 'mat'."

Because The Draft is now aligned with The Target's future plan, his guesses are much more likely to be correct.

3. The "Chameleon" Assistant (MoE)

Here is the tricky part: The Target's "thoughts" change depending on what he's doing. Sometimes he's writing code, sometimes he's telling a joke, and sometimes he's solving a math problem. A single "whisper" isn't enough for all these different situations.

So, ConFu gives The Draft a Chameleon Assistant (using a technology called Mixture-of-Experts).

If the story is about math, the Chameleon switches to "Math Mode" and interprets the whisper as "solve for X."
If the story is about cooking, it switches to "Recipe Mode" and interprets the whisper as "add salt."

This allows The Draft to understand the context perfectly, no matter what The Target is doing.

4. The "Practice Run" (Training)

To teach The Draft how to use these whispers, the researchers didn't just tell him to guess. They made him practice.

They took a long story and randomly picked a few spots (Anchor Tokens) to practice on.
They made The Draft guess the future based on the whisper from that spot.
Crucially, they told The Draft: "If you get the whisper right for this spot, you should be able to get the guesses right for the words right next to it too."

This made The Draft very robust. He learned that the "future direction" doesn't change instantly; it stays consistent for a little while, helping him make better guesses even when the exact spot shifts slightly.

The Result: A Faster Storyteller

When they tested this new system (ConFu) against the current best method (EAGLE-3), the results were impressive:

More Accepted Guesses: The Target accepted The Draft's guesses about 8-11% more often.
Faster Speed: Because fewer guesses were rejected, the whole story was written much faster.
Better Alignment: The Draft stayed on the "semantic track" of the story much longer, preventing those annoying moments where the story goes off the rails.

Why This Matters

Think of it like a GPS.

Old Way: The GPS only tells you the next turn. If you miss it, you're lost and have to recalculate the whole route.
ConFu Way: The GPS tells you the next turn and whispers, "In 5 miles, we're going to a beach resort." Because you know the destination, you can anticipate the turns better, stay on the right road, and get there faster without getting lost.

ConFu is a clever trick that lets a fast, simple AI assistant "read the mind" of a slow, smart AI master. By giving the assistant a glimpse of the future, they work together much more efficiently, making AI faster and cheaper to run without losing any of its intelligence.

Here is a detailed technical summary of the paper "ConFu: Contemplate the Future for Better Speculative Sampling".

1. Problem Statement

Speculative Decoding accelerates Large Language Model (LLM) inference by using a lightweight "draft" model to propose candidate tokens, which are then verified in parallel by a larger "target" model. While effective, the performance of this paradigm is bottlenecked by the quality of the draft model.

The Core Limitation: Existing state-of-the-art draft models (e.g., the EAGLE series) condition their predictions solely on the current prefix (the sequence of tokens generated so far).
Error Accumulation: As decoding proceeds, small prediction errors accumulate. The draft model's hidden representations drift away from the target model's distribution, causing the draft tokens to diverge from the target's intended trajectory. This leads to a decline in token acceptance rates and reduced speedup.
The Gap: Current methods lack a mechanism for the draft model to anticipate the future direction or "thought" of the target model before committing to specific token choices.

2. Methodology: ConFu (Contemplate the Future)

ConFu is a novel speculative decoding framework designed to bridge this gap by enabling the draft model to leverage future-oriented signals from the target model. It introduces three key innovations:

A. Contemplate Tokens and Soft Prompts

Concept: Instead of just predicting the next token, the system aims to capture the target model's current "thought" (intermediate reasoning state) to guide the draft model.
Implementation:
- Soft Prompts: Learnable prompt tokens are prepended to the target model's KV cache to instruct the model to generate future predictions.
- Contemplate Tokens: A special token (similar to a "pause token") is appended to the input prefix. This token triggers the target model to perform additional computation to generate a continuous embedding representing its future trajectory.
- Mechanism: During inference, the hidden state of the contemplate token is extracted and passed as an auxiliary future token ( $f$ ) to the draft model. The draft model conditions its autoregressive generation on this $f$ , allowing it to align with the target's semantic trajectory.
Verification: To handle the fact that the final accepted token is unknown during the drafting phase, ConFu inserts a contemplate token after every draft node in the speculative tree. The target model verifies draft tokens and generates corresponding future predictions in parallel. The future prediction associated with the last accepted token is passed to the next iteration.

B. Dynamic Contemplate Tokens with MoE

Challenge: A single static embedding for the contemplate token cannot capture the diverse "thoughts" required across different contexts (e.g., mathematical reasoning vs. creative writing).
Solution: The paper replaces the static embedding with a Mixture-of-Experts (MoE) module.
- The MoE takes the hidden state of the most recently accepted token as input.
- A router network selects a weighted combination of multiple learnable expert embeddings.
- This creates a dynamic contemplate token that adapts to the specific context, providing more accurate and expressive future predictions.

C. Robust Training Framework

To train the draft model effectively without prohibitive memory costs, ConFu employs two strategies:

Anchor Token Sampling: Instead of inserting contemplate tokens for every position in a sequence (which would double sequence length), the system randomly samples a subset of "anchor tokens." Contemplate tokens are only inserted for these anchors, significantly reducing memory overhead.
Future Prediction Replication: To encourage robustness, the future prediction ( $f$ ) generated for an anchor token is reused for nearby non-anchor tokens within a local window. This forces the model to learn that nearby tokens share similar high-level intents, stabilizing the future prediction signal.

3. Key Contributions

First Integration of Continuous Reasoning: ConFu is the first work to explicitly bridge speculative decoding with continuous latent "thought" representations, allowing draft models to anticipate future generation directions.
Novel Architecture: Introduces a dynamic, MoE-based contemplate token mechanism that adapts to diverse contexts, overcoming the limitations of static embeddings.
Efficient Training Strategy: Develops a memory-efficient training pipeline using anchor token sampling and prediction replication, enabling robust learning of future signals without doubling sequence lengths.
Negligible Overhead: The method adds minimal inference cost (processing a single auxiliary token) while significantly boosting acceptance rates.

4. Experimental Results

The authors evaluated ConFu against EAGLE-3 (the current state-of-the-art) using Llama-3 3B and 8B models on the SpecBench benchmark.

Performance Gains:
- ConFu improved token acceptance rates and generation speed by 8–11% across various downstream tasks (writing, QA, coding, math, etc.).
- Speedup Ratio (SR): Achieved consistent improvements over EAGLE-3 across different sampling temperatures ( $T=0, 0.7, 1.0$ ) and draft tree budgets (30 and 60 nodes).
- Temperature Sensitivity: The improvements were most pronounced at lower temperatures (greedy decoding), where the target distribution is sharper and future direction is easier to anticipate.
Ablation Studies:
- Removing the MoE (using static tokens) resulted in a drop in performance, confirming the value of context-aware dynamic tokens.
- Removing Future Prediction Replication also reduced performance, validating the importance of robust training for stable future signals.

5. Significance and Impact

Mitigating Error Accumulation: By conditioning draft generation on the target model's predicted semantic trajectory, ConFu effectively reduces the drift between draft and target distributions, pushing speculative decoding closer to its theoretical efficiency limits.
System-Level Optimization: Unlike methods that require retraining the target model or modifying its architecture, ConFu operates as an inference-time optimization. It preserves the target model's original sampling distribution and safety characteristics.
Future Direction: This work opens a new avenue for accelerating LLMs by integrating latent reasoning (continuous thought) into the speculative sampling loop, suggesting that "thinking ahead" is a viable strategy for improving inference efficiency.
Practical Application: The method offers a path to faster, more energy-efficient text generation, beneficial for resource-constrained environments like edge devices and real-time systems.