Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

Imagine you are a master chef (the Large Language Model or "Verifier") trying to write a complex recipe. You are incredibly talented, but you are also very slow and expensive to run because you have to think about every single word before you write it down.

To speed things up, you hire a fast, energetic apprentice (the Draft Model). The apprentice guesses the next few words of the recipe very quickly.

The Old Way: The Strict Taste-Test

In the traditional method (called Speculative Sampling), the apprentice writes down a whole paragraph of guesses. Then, the master chef stops and checks every single word against their own knowledge.

If the apprentice's guess matches the chef's perfect vision, the chef says, "Yes, keep it!"
If the chef thinks, "Hmm, I would have chosen a slightly different word," they reject the whole paragraph and start over.

This is safe, but it's frustrating. Sometimes the apprentice's guess is almost perfect and would taste great, but because it wasn't exactly what the chef would have said, it gets thrown away. This wastes the apprentice's speed.

The Problem with the "Loose" Fix

Recently, some people tried a "looser" approach (called TAS). They said, "Let's just accept the apprentice's guess if it's probably good enough, even if it's not perfect."

The Catch: This is like letting the apprentice add too much salt or weird spices just to make the cooking faster. Sometimes it works, but often the final dish tastes "off" or loses the subtle, critical flavors the chef was trying to capture. The quality of the recipe drops.

The New Solution: CACTUS (The Smart Compromise)

The authors of this paper, CACTUS, propose a new way to balance speed and quality. Think of it as a Smart Quality Control System.

Instead of demanding a perfect match (too slow) or accepting anything that looks okay (too risky), CACTUS sets a strict "tolerance limit."

Here is how it works with a simple analogy:

The "Bonus" System: Imagine the apprentice suggests a word. The master chef looks at it. If the word is good, the chef gives it a tiny "bonus" to make it even more likely to be accepted, but only if it stays within a specific "flavor profile."
The Safety Net: CACTUS uses a mathematical rule (a constraint) to ensure that the final recipe never drifts too far from the master chef's original style. It's like having a GPS that says, "You can take a shortcut to save time, but you must stay within 5 miles of the main highway."
The Result: The apprentice gets to keep more of their guesses (making the process much faster), but the final dish still tastes exactly like the master chef made it.

Why is this a big deal?

Speed: It accepts more of the apprentice's guesses, so the computer generates text much faster (like writing a novel in half the time).
Quality: Unlike the "loose" methods that ruin the flavor, CACTUS guarantees the output remains high-quality and accurate.
No Training Needed: You don't need to retrain the AI models to use this. It's like giving the existing chef and apprentice a new set of rules to follow, rather than hiring new people.

In a nutshell: CACTUS is a clever trick that lets AI models "cheat" a little bit to go faster, but it puts a strict leash on the cheating so the AI never gets lost or starts hallucinating nonsense. It gets the best of both worlds: speed without the sacrifice of quality.

1. Problem Statement

Auto-regressive Large Language Models (LLMs) face significant computational bottlenecks because generating each token requires a memory-bound forward pass through hundreds of billions of parameters. Speculative Sampling (SpS) addresses this by using a smaller "draft" model to propose multiple tokens, which are then verified in parallel by the larger "verifier" model.

However, standard SpS enforces strict distributional equivalence with the verifier. It rejects any draft token that does not perfectly match the verifier's probability distribution, even if the token is correct but has a slightly lower probability. This strictness limits the acceptance rate.

Typical Acceptance Sampling (TAS) attempts to relax this by using entropy-based heuristics to accept more tokens. However, the paper argues that TAS distorts the verifier's output distribution significantly. By collapsing the distribution into low-entropy, deterministic outputs, TAS risks semantic drift, degrading output quality, especially when the verifier encodes critical, high-entropy information (e.g., in complex reasoning tasks).

2. Methodology: Cactus

The authors propose Cactus (Constrained Acceptance Speculative Sampling), a method that formalizes speculative sampling as a constrained optimization problem.

Theoretical Formulation

Instead of strictly matching the verifier distribution $q$ , Cactus seeks a target distribution $h$ that:

Maximizes the acceptance rate (specifically the probability of accepting the drafted token $n$ ).
Remains within a bounded distance (divergence) $\delta$ from the verifier distribution $q$ .

This is formulated as:
$\max_{h} \min \left\{ \frac{h(n)}{p(n)}, 1 \right\} \quad \text{s.t.} \quad D_f(h \| q) \leq \delta$
Where $p$ is the draft model, $D_f$ is an $f$ -divergence metric (specifically KL-divergence in Cactus), and $\delta$ is a hyperparameter controlling the allowed deviation.

The Algorithm

The paper derives an optimal solution for $h$ where the probability of the drafted token $n$ is increased by a "bonus" $\gamma^*$ , while the probabilities of all other tokens are scaled down proportionally to maintain a valid probability distribution.

Exact Solution: The optimal $\gamma^*$ is the root of a transcendental equation involving the $f$ -divergence function.
Approximation (Cactus): Since the exact root is computationally expensive, Cactus uses a second-order Taylor series approximation of the KL-divergence function around the verifier's probability $q(n)$ . This yields a closed-form solution:
$\gamma^* \approx \min \left( q(n) + \sqrt{2\delta q(n)(1 - q(n))}, 1 \right)$
The final distribution $h$ is constructed by setting $h(n) = \gamma^*$ and scaling the remaining mass of $q$ to the other tokens.

Key Advantages over TAS

Controlled Divergence: Cactus explicitly guarantees that the divergence between the generated distribution and the verifier is bounded by $\delta$ .
Entropy Preservation: Unlike TAS, which minimizes cross-entropy (often leading to zero-entropy, deterministic outputs), Cactus minimizes KL-divergence, preserving the shape and entropy of the verifier's distribution.
Efficiency: Cactus only requires reading the probability of the single drafted token $n$ , avoiding the need to access the full vocabulary, which reduces memory overhead.

3. Key Contributions

Theoretical Framework: The first formalization of speculative sampling as a constrained optimization problem, explicitly trading off acceptance rate against distributional divergence.
Novel Algorithm: The proposal of Cactus, a training-free, lightweight modification that provably approximates the verifier distribution while achieving higher acceptance rates.
Theoretical Analysis of TAS: The paper demonstrates that TAS implicitly solves a variant of the optimization problem using cross-entropy, which leads to suboptimal, low-entropy distributions that fail to capture the full shape of the verifier's output.
Empirical Validation: Extensive experiments showing Cactus outperforms both standard SpS and TAS across diverse benchmarks and model sizes.

4. Experimental Results

The authors evaluated Cactus on GSM8K (math), IFEval (instruction following), and GPQA (science) using various Qwen 3 model pairs (e.g., 0.6B drafter + 8B/14B/32B verifier).

Throughput (Acceptance Rate): Cactus consistently achieved the highest Average Accepted Length (AL). For example, on GSM8K with $m=20$ , Cactus achieved an AL of 7.61 (vs. 5.44 for SpS and 7.23 for TAS), representing a ~39% reduction in rejected tokens.
Accuracy:
- SpS: Maintains verifier accuracy (lossless).
- TAS: Often degrades accuracy on complex tasks (e.g., dropped from 42.93 to 38.89 on GPQA) due to distributional distortion.
- Cactus: Maintains or improves accuracy compared to the verifier. On GPQA, Cactus (with $\delta=0.75$ ) achieved 45.46, significantly outperforming both SpS (42.93) and TAS (38.89).
Wall-Time Speedup: On A100 GPUs, Cactus achieved up to 1.9x speedup over the verifier alone while maintaining the highest task scores.
Generalization: The method was tested on other model families (Gemma, DeepSeek R1, LLaMA) and larger model pairs (1.7B + 32B), consistently showing robust performance and superior trade-offs compared to baselines.

5. Significance

Practical Efficiency: Cactus provides a theoretically grounded, training-free method to significantly accelerate LLM inference without sacrificing (and sometimes enhancing) generation quality.
Quality-Efficiency Balance: It solves the "quality vs. speed" dilemma often seen in lossy sampling methods. By strictly controlling divergence, it prevents the semantic drift that plagues other heuristic-based acceleration methods.
Scalability: The method is lightweight (element-wise operations) and scales effectively to larger models and diverse architectures, making it suitable for real-time deployment in resource-constrained environments.

In conclusion, Cactus represents a significant advancement in speculative decoding by shifting the paradigm from "strict matching" or "heuristic relaxation" to constrained optimization, ensuring that acceleration is achieved without compromising the fidelity of the model's output.

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

The Old Way: The Strict Taste-Test

The Problem with the "Loose" Fix

The New Solution: CACTUS (The Smart Compromise)

Why is this a big deal?

1. Problem Statement

2. Methodology: Cactus

Theoretical Formulation

The Algorithm

Key Advantages over TAS

3. Key Contributions

4. Experimental Results

5. Significance

More like this

A Theory-guided Weighted L2L^2L2 Loss for solving the BGK model via Physics-informed neural networks

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

Enhancing sample efficiency in reinforcement-learning-based flow control: replacing the critic with an adaptive reduced-order model

Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression

Learning-Based Multi-Criteria Decision Making Model for Sawmill Location Problems

A Theory-guided Weighted $L^2$ Loss for solving the BGK model via Physics-informed neural networks