Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification

Imagine you are a master chef (the AI Model) trying to write a complex recipe for a very fancy dish. The problem is, the chef is incredibly slow. They can only write one word at a time, and before they write the next word, they have to walk all the way to the pantry, grab the exact spice jar they need, and bring it back. This "walk to the pantry" is the memory bottleneck. The chef spends 90% of their time walking and only 10% actually cooking.

The Old Way: Speculative Decoding

To speed things up, someone suggested a new strategy called Speculative Decoding.

Imagine the chef hires a fast, energetic sous-chef (the Draft Model).

The Draft: The sous-chef quickly guesses the next 5 words of the recipe and writes them down on a sticky note.
The Verification: The master chef looks at the sticky note. Instead of writing the words one by one, they check all 5 words at once to see if the sous-chef was right.
- If the chef agrees, they accept all 5 words instantly.
- If the chef disagrees with the 3rd word, they throw away the rest and start over from there.

The Problem:
Even though the sous-chef is fast, the master chef is still the bottleneck. Every time they check the sticky note, they still have to walk to the pantry to get the heavy, full-sized spice jars (the Full-Precision Weights) to verify the words. As the sous-chef gets faster and writes longer notes, the chef spends more time walking to the pantry than actually cooking. The system hits a "memory wall."

The New Solution: Quasar

Enter Quasar (Quantized Self-Speculative Acceleration for Rapid Inference).

Quasar asks a simple question: "Does the chef really need the giant, heavy, full-sized spice jars just to check if the words look right? Or can they use a smaller, lighter version?"

The Analogy of "Lightweight Verification":
Imagine the master chef keeps a miniature, lightweight spice rack right on the counter. These spices are slightly less detailed (they are Quantized or compressed), but they are right there.

The Draft: The sous-chef still writes the 5 words quickly.
The New Verification: Instead of walking to the heavy pantry, the chef grabs the lightweight spice rack from the counter.
- Because the rack is light and close by, the chef can check the words twice as fast.
- The chef doesn't need the perfect detail of the giant jar to know if "salt" is the right word; the lightweight version is accurate enough to make the decision.

Why This is a Big Deal

The paper proves two surprising things:

It's Fast: By using the lightweight rack, the chef stops wasting time walking to the pantry. The whole kitchen runs 28% faster (1.28x speedup) without changing the recipe.
It's Accurate: You might think, "If I use a cheap spice rack, the food will taste bad." But the researchers found that for the specific job of checking the words, the lightweight rack is almost perfect. The final dish tastes exactly the same as if the chef used the heavy jars.

Why Not Just Cut the Chef's Arms? (The Pruning Comparison)

The paper also looked at another idea: Pruning. This is like trying to speed up the chef by cutting off their arms or legs (removing parts of the AI model) to make them lighter.

The Result: It didn't work. If you cut off too much, the chef can't think clearly and guesses wrong constantly. If you cut off too little, they are still too heavy to move fast.
Quasar's Win: Quasar keeps the chef's full brain (the whole model) but just makes the tools they use for checking lighter. It's the perfect balance.

The Bottom Line

Quasar is like giving a marathon runner a pair of super-lightweight shoes. They don't change the runner's muscles or training (the AI's intelligence); they just remove the unnecessary weight that was slowing them down.

Before: The AI was stuck in traffic, waiting for heavy data to load.
Now: The AI drives a sports car with a lighter engine, zooming through the same route much faster, arriving at the same destination with the same quality.

This allows AI to answer your questions, write code, and solve math problems significantly faster, making powerful AI more accessible and responsive for everyone.

1. Problem Statement

The Verification Bottleneck in Speculative Decoding:
While Speculative Decoding (SD) has successfully accelerated Large Language Model (LLM) inference by decoupling token generation into a "drafting" phase (fast prediction) and a "verification" phase (parallel validation), recent advancements in Self-Speculative Decoding have shifted the primary performance bottleneck.

Current State: In self-speculation, the target model generates drafts (e.g., by skipping layers or using n-grams). While drafting overhead is minimized, the verification phase requires a full forward pass of the target model to validate the draft tokens.
The Limitation: LLM inference is predominantly memory-bandwidth bound, not compute-bound. Loading full-precision weights (typically BF16) for the verification step saturates the memory bus. As draft lengths increase to maximize speedup, the latency of this memory-heavy verification step negates the gains from drafting, creating a "memory wall."
The Gap: Existing methods focus on optimizing the drafting strategy, but few address the inefficiency of the verification step itself. Traditional attempts to speed up verification via structural pruning (dropping layers) often degrade the model's distributional alignment, causing low acceptance rates and failing to provide speedups.

2. Methodology: Quasar

The authors propose Quasar (Quantized Self-speculative Acceleration for Rapid Inference), a training-free framework that accelerates the verification phase by employing low-bit quantization specifically for the verifier.

Core Insight

Instead of using a full-precision model for verification, Quasar uses a quantized version (W8A8: 8-bit weights, 8-bit activations) of the target model. This reduces memory traffic by approximately 50% while maintaining sufficient fidelity to preserve the logit distribution required for accurate rejection sampling.

Key Technical Components

Enhanced SmoothQuant (m2 Algorithm):
- LLMs suffer from activation outliers that degrade quantization accuracy. Quasar employs an enhanced variant of SmoothQuant.
- Mechanism: It introduces a per-channel smoothing factor vector ( $s$ ) to scale activations and inversely scale weights ( $Y = WX \rightarrow \tilde{W}\tilde{X}$ ). This migrates quantization sensitivity from activations to weights, allowing for robust 8-bit representation without significant accuracy loss.
- Offline Calibration: Smoothing factors are calibrated offline.
- Online Execution: During inference, input activations are dynamically smoothed and quantized on-the-fly before being processed by INT8 tensor cores.
Execution Pipeline:
- Offline: Target model weights are smoothed and quantized to INT8, reducing GPU memory footprint by ~2x.
- Online:
  - Input activations (BF16) are transformed and quantized to INT8.
  - Matrix multiplication is performed using INT8 tensor cores.
  - Results are accumulated in INT32 and dequantized back to BF16 for non-linear layers (LayerNorm, GeLU) and the final Softmax.
- Rejection Sampling: The dequantized logits are used in the standard rejection sampling criterion. Since the final logits are restored to high precision, the sampling logic remains robust, ensuring the final output distribution matches the original full-precision model.
Theoretical Speedup:
- Verification latency ( $T_{verify}$ ) is dominated by memory loading ($M/BW$).
- By halving the weight precision (BF16 $\to$ INT8), the memory load is halved, significantly reducing $T_{verify}$ .
- The total throughput $S$ is maximized because the reduction in verification time outweighs any minor computational overhead, provided the acceptance rate ( $\alpha$ ) remains high.

3. Key Contributions

Identification of the Bottleneck: The paper empirically identifies that in modern self-speculative decoding, the verification phase is the primary bottleneck due to memory bandwidth constraints, not the drafting phase.
Quantized Verification Framework: Introduces Quasar, the first framework to utilize low-bit quantization specifically for the verifier (rather than just the drafter), proving that a quantized model can serve as a high-fidelity verifier.
Training-Free & Orthogonal: The method requires no retraining or distillation and is orthogonal to existing drafting strategies (e.g., N-gram, EAGLE, Medusa), making it a generic acceleration layer.
Robustness Analysis: Demonstrates that W8A8 quantization preserves the relative logit rankings necessary for rejection sampling, unlike structural pruning which causes catastrophic distributional shifts.

4. Experimental Results

The authors evaluated Quasar on state-of-the-art models (OpenPangu-7B and Qwen3-8B) across diverse benchmarks (MT-bench, HumanEval, GSM8k, Alpaca, CNN/DM).

End-to-End Speedup:
- Quasar achieves a 1.28× improvement in end-to-end throughput on Qwen3 (greedy decoding) compared to standard full-precision verification.
- On reasoning-heavy tasks like GSM8k, the speedup reaches 1.64×.
- It consistently outperforms the N-gram baseline (which uses full-precision verification), which only achieved 1.18× speedup.
Acceptance Length:
- Contrary to concerns about quantization noise, Quasar maintains or even exceeds the mean acceptance length ( $L$ ) of full-precision baselines (e.g., $L=1.40$ for Quasar vs. $L=1.33$ for N-gram on Qwen3).
Robustness:
- Performance remains stable across sampling temperatures ( $T=0$ to $T=1$ ). Even at high stochasticity ( $T=1$ ), Quasar maintains a 1.23× speedup, outperforming the baseline's 1.15×.
Accuracy Preservation:
- Downstream task accuracy (MMLU-pro, CEval, etc.) shows negligible degradation (average difference < 3.1%), confirming "near-lossless" compression.
Comparison with Pruning:
- Experiments show that structural pruning (dropping layers) fails to provide speedups. Aggressive pruning reduces acceptance rates drastically, while conservative pruning adds too much compute overhead. Quasar's approach of keeping full depth but reducing precision is superior.

5. Significance and Future Work

Breaking the Memory Wall: Quasar provides a practical solution to the memory-bandwidth bottleneck that limits speculative decoding, enabling faster inference without compromising generation quality.
Hardware Efficiency: By leveraging INT8 tensor cores, it maximizes the utilization of modern hardware accelerators (GPUs/NPUs) that are often underutilized in memory-bound LLM workloads.
Future Directions: The authors suggest exploring ultra-low bit verification (4-bit/2-bit), dynamic precision scaling based on draft confidence, and integration with tree-based speculation methods (e.g., EAGLE).

In summary, Quasar redefines the efficiency limits of speculative decoding by shifting the optimization focus from the drafter to the verifier, using quantization to turn a memory-bound bottleneck into a compute-efficient process.

Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification

The Old Way: Speculative Decoding

The New Solution: Quasar

Why This is a Big Deal

Why Not Just Cut the Chef's Arms? (The Pruning Comparison)

The Bottom Line

1. Problem Statement

2. Methodology: Quasar

Core Insight

Key Technical Components

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank