The Big Problem: The "Speeding Car" That Loses Its Way

Imagine you are trying to write a very long story (like a novel) with a brilliant but slow-thinking author (the Target Model). To save time, you hire a fast, energetic intern (the Draft Model) to guess the next few sentences before the author even reads them.

In the world of AI, this is called Speculative Decoding. The intern guesses a paragraph, and the author quickly checks it. If the intern is right, the author just says "Good job!" and moves on, skipping the hard work of writing those words from scratch. If the intern is wrong, the author has to stop, correct the mistake, and start over.

The Catch:
The paper discovered a major flaw in how these "interns" are trained.

The Training: The interns are trained on short stories (like tweets or short emails). They are great at guessing the next word in a 200-word sentence.
The Reality: In the real world, people ask AI to write long reports, code, or stories that are thousands of words long.

As the story gets longer, the intern starts to get confused. Because they were only trained on short sentences, they lose their "train of thought" as the text grows. They start guessing words that don't fit the long context.

The Result: The author has to reject almost all of the intern's guesses. Instead of saving time, the process slows down because the author is constantly stopping to correct the intern. The paper calls this the "Acceptance Length" dropping to nearly 1 (meaning the intern is basically useless).

The Solution: "Test-Time Speculation" (TTS)

The authors propose a clever fix called Test-Time Speculation (TTS). Instead of hiring a new intern for every job, they teach the same intern how to adapt while they are working.

The Analogy: The Live Coaching Session
Imagine the intern is writing the story, and the author is checking it.

Old Way: The intern guesses 10 words. The author checks them. If they are wrong, the author fixes them and moves on. The intern learns nothing from the mistake because they are never told why they were wrong in a way that helps them for the next sentence.
The TTS Way: Every time the author checks the intern's work, the author doesn't just say "Right" or "Wrong." The author uses that moment to give the intern a mini-lesson.
- The author says, "You guessed 'cat', but in this specific long story, the word should be 'dog'. Here is the exact probability distribution I used."
- The intern immediately updates their brain (their internal math) based on this specific lesson.
- Now, when the intern guesses the next set of words, they are slightly smarter and better aligned with the author's current mood and the story's long history.

Why is this special?
Usually, you have to stop and retrain a model for days to make it better. TTS does this instantly while the story is being written. It uses the "verification" step (which the author has to do anyway) as a free training signal. It's like a student learning a new language by having a conversation with a teacher, where the teacher corrects them in real-time, making them fluent by the end of the conversation.

The Results: Getting Faster the Longer You Go

The paper tested this on several different types of "authors" (AI models) and "interns" (speculators) across difficult tasks like solving math problems, writing code, and answering science questions.

The Improvement: By using TTS, the "interns" became much better at guessing the right words as the story got longer.
The Numbers: On average, the system accepted 41% more of the intern's guesses. In some cases, it was up to 72% better than the previous best methods.
The Trend: The longer the text gets, the better TTS works. While other methods fail after a few thousand words, TTS actually gets more accurate as the generation continues because the intern keeps learning and adapting on the fly.

Summary

Think of previous methods as hiring a fast runner who is only good for a 100-meter sprint. When you ask them to run a marathon, they collapse.

Test-Time Speculation is like giving that runner a coach who runs alongside them, whispering corrections and strategy adjustments every single step of the way. The runner gets tired less, stays on the right path, and the whole team finishes the marathon much faster.

The paper proves that by letting the AI "learn on the job" during the generation process, we can keep AI fast and efficient, even when writing very long documents.

Technical Summary: Test-Time Speculation (TTS)

1. Problem Statement

The paper identifies a critical limitation in current state-of-the-art speculative decoding methods (such as DFlash, EAGLE-3, and PARD) when applied to long-response tasks. While speculative decoding accelerates Large Language Model (LLM) inference by using a fast "draft" model to generate tokens and a slower "target" model to verify them, its efficiency relies heavily on the acceptance length—the number of consecutive draft tokens accepted by the target model per round.

The authors observe that acceptance lengths for existing speculators degrade significantly as the generation length increases. Within just a few thousand output tokens, acceptance lengths often drop to values close to 1 (e.g., 1.1), effectively eliminating any speedup benefits. This degradation occurs because state-of-the-art speculators are trained offline on short sequences (typically $\le$ 2K tokens), creating a distribution mismatch when they are forced to approximate the target model on much longer sequences (e.g., 20K–32K tokens) during inference. As the generation proceeds, the draft model's predictions diverge from the target's increasingly confident distribution, leading to frequent rejections.

2. Methodology: Test-Time Speculation (TTS)

To address this distribution mismatch, the authors propose Test-Time Speculation (TTS), an online distillation approach that adapts the draft model in real-time during the inference process.

Core Insight

The key realization is that the standard speculative decoding verification step already generates the necessary supervision signal for adaptation without additional cost. In every round, the target model computes its full probability distribution over the draft tokens. TTS leverages this by treating:

The Target Model as the "Teacher."
The Draft Model as the "Student."
The Verified Draft Tokens as the distillation training sample.

Algorithm

TTS interleaves generation with model updates. The process for each speculation round is as follows:

Drafting: The current draft model ( $q_t$ ) generates a canvas of $C$ tokens.
Verification: The target model ( $p$ ) evaluates the canvas in a single forward pass, determining the acceptance length ( $\tau$ ) via standard rejection sampling.
Distillation Loss: Before the next round, the draft model is updated using a single gradient step on a distillation loss function:
$L_t(q) = \tilde{KL}(p \parallel q) + \lambda \tilde{KL}(q_t \parallel q)$
- The first term approximates the Kullback-Leibler (KL) divergence between the target's distribution and the new draft distribution over the canvas.
- The second term is a regularization component preventing the draft from drifting too far from its previous state ( $q_t$ ).
- Position-dependent weights ( $w_k$ ) are applied, prioritizing earlier tokens in the canvas.
Update: The draft model parameters are updated ( $q_{t+\tau} \leftarrow q_t - \eta \nabla L_t$ ).

System Optimizations

To manage the trade-off between improved acceptance length and the latency overhead of gradient updates, TTS employs:

Strided Updates: Gradient updates are performed every $S$ rounds rather than every round, amortizing the computational cost.
Asynchronous Pipelining: Updates are offloaded to a dedicated CUDA stream that runs in parallel with the subsequent $S-1$ generation rounds, hiding the latency from the critical path.

3. Key Contributions

Diagnosis of Degradation: The authors demonstrate that the efficacy of current speculators degrades with generation length due to a mismatch between the draft's short-sequence training distribution and the long-sequence inference distribution.
TTS Framework: They propose Test-Time Speculation, an online distillation method that utilizes the verification step as a supervision signal to adapt the draft model at inference time, requiring no offline retraining.
Comprehensive Evaluation: The method is evaluated across five state-of-the-art models (Qwen-3, Qwen-3.5, Llama3.1 families) and eight diverse benchmarks (including AIME, LiveCodeBench, and GPQA), showing consistent improvements.
System Integration: The authors implement TTS within the SGLang inference framework, addressing system-level challenges such as kernel differentiation and CUDA graph synchronization.

4. Experimental Results

Acceptance Length Improvement: TTS improves mean acceptance lengths by up to 72% and an average of 41% over DFlash, and up to 67% (average 34%) over EAGLE-3.
Scaling with Length: The benefits of TTS scale with generation length. For example, on the AIME 2024 dataset, the improvement over DFlash grows from 15% in the first 0–10K tokens to 183% in the 20–30K token range.
Throughput: While frequent updates (stride $S=1$ ) maximize acceptance length, a stride of $S=5$ achieves the best throughput speedup (up to 1.71 $\times$ over DFlash) by balancing adaptation frequency with update overhead.
Generalization: TTS is effective across different model sizes (4B to 122B) and architectures (Dense and MoE), particularly compensating for speculators trained on short contexts (e.g., EAGLE-3 with 2K context) when applied to targets with much larger context windows.

5. Significance and Claims

The paper claims that TTS fundamentally addresses the limitation of speculative decoding in real-world, long-response scenarios. By adapting the draft model during the generation process, TTS closes the gap between training and inference distributions, ensuring that speculative decoding remains effective even for outputs spanning tens of thousands of tokens.

The authors emphasize that TTS requires no assumptions about the request stream structure (unlike prior online methods that rely on domain-specific buffers) and operates directly on top of existing, public state-of-the-art speculators. This makes TTS a practical solution for maintaining high inference throughput in production environments where long-form generation (e.g., code, reasoning, content creation) is dominant. The work is presented as a necessary evolution to keep speculative decoding viable as LLM applications shift toward longer context windows.

Test-Time Speculation