Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification

Quasar is a training-free framework that accelerates Large Language Model inference by applying low-bit quantization to the verification stage of speculative decoding, effectively overcoming memory bandwidth bottlenecks to achieve a 1.28× throughput improvement while maintaining high acceptance rates.

Guang Huang, Zeyi Wen

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you are a master chef (the AI Model) trying to write a complex recipe for a very fancy dish. The problem is, the chef is incredibly slow. They can only write one word at a time, and before they write the next word, they have to walk all the way to the pantry, grab the exact spice jar they need, and bring it back. This "walk to the pantry" is the memory bottleneck. The chef spends 90% of their time walking and only 10% actually cooking.

The Old Way: Speculative Decoding

To speed things up, someone suggested a new strategy called Speculative Decoding.

Imagine the chef hires a fast, energetic sous-chef (the Draft Model).

  1. The Draft: The sous-chef quickly guesses the next 5 words of the recipe and writes them down on a sticky note.
  2. The Verification: The master chef looks at the sticky note. Instead of writing the words one by one, they check all 5 words at once to see if the sous-chef was right.
    • If the chef agrees, they accept all 5 words instantly.
    • If the chef disagrees with the 3rd word, they throw away the rest and start over from there.

The Problem:
Even though the sous-chef is fast, the master chef is still the bottleneck. Every time they check the sticky note, they still have to walk to the pantry to get the heavy, full-sized spice jars (the Full-Precision Weights) to verify the words. As the sous-chef gets faster and writes longer notes, the chef spends more time walking to the pantry than actually cooking. The system hits a "memory wall."

The New Solution: Quasar

Enter Quasar (Quantized Self-Speculative Acceleration for Rapid Inference).

Quasar asks a simple question: "Does the chef really need the giant, heavy, full-sized spice jars just to check if the words look right? Or can they use a smaller, lighter version?"

The Analogy of "Lightweight Verification":
Imagine the master chef keeps a miniature, lightweight spice rack right on the counter. These spices are slightly less detailed (they are Quantized or compressed), but they are right there.

  1. The Draft: The sous-chef still writes the 5 words quickly.
  2. The New Verification: Instead of walking to the heavy pantry, the chef grabs the lightweight spice rack from the counter.
    • Because the rack is light and close by, the chef can check the words twice as fast.
    • The chef doesn't need the perfect detail of the giant jar to know if "salt" is the right word; the lightweight version is accurate enough to make the decision.

Why This is a Big Deal

The paper proves two surprising things:

  1. It's Fast: By using the lightweight rack, the chef stops wasting time walking to the pantry. The whole kitchen runs 28% faster (1.28x speedup) without changing the recipe.
  2. It's Accurate: You might think, "If I use a cheap spice rack, the food will taste bad." But the researchers found that for the specific job of checking the words, the lightweight rack is almost perfect. The final dish tastes exactly the same as if the chef used the heavy jars.

Why Not Just Cut the Chef's Arms? (The Pruning Comparison)

The paper also looked at another idea: Pruning. This is like trying to speed up the chef by cutting off their arms or legs (removing parts of the AI model) to make them lighter.

  • The Result: It didn't work. If you cut off too much, the chef can't think clearly and guesses wrong constantly. If you cut off too little, they are still too heavy to move fast.
  • Quasar's Win: Quasar keeps the chef's full brain (the whole model) but just makes the tools they use for checking lighter. It's the perfect balance.

The Bottom Line

Quasar is like giving a marathon runner a pair of super-lightweight shoes. They don't change the runner's muscles or training (the AI's intelligence); they just remove the unnecessary weight that was slowing them down.

  • Before: The AI was stuck in traffic, waiting for heavy data to load.
  • Now: The AI drives a sports car with a lighter engine, zooming through the same route much faster, arriving at the same destination with the same quality.

This allows AI to answer your questions, write code, and solve math problems significantly faster, making powerful AI more accessible and responsive for everyone.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →