EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs

EAGLE-Pangu is a reproducible system that adapts EAGLE-3-style tree speculative decoding for Ascend NPUs with Pangu teacher models by introducing an explicit cache manager, accelerator-safe tensorization, and a fused-kernel verification path, achieving up to 2.46x throughput improvement over greedy decoding.

Chang Han, Yijie Hu, Jingling Liu

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to write a long story with a very famous, brilliant, but incredibly slow author (the Teacher Model). Every time you want to add a new word to the story, you have to wait for this author to think, write, and approve it. If you need 1,000 words, you have to wait for them to think 1,000 separate times. This is the bottleneck in making AI chatbots fast.

To speed this up, you hire a fast, energetic intern (the Draft Model) to guess the next few words. Usually, you just let the intern guess one word, show it to the author, and if the author agrees, you keep it. If not, you start over.

EAGLE-PANGU is a new system that changes the game. Instead of guessing just one word, the intern now builds a tree of possibilities. Imagine the intern doesn't just say, "I think the next word is 'cat'." Instead, they say:

  • "Maybe it's 'cat'..."
  • "Or maybe it's 'dog'..."
  • "Or maybe it's 'bird'..."

And then, for each of those, they guess another word. They create a branching tree of "what if" scenarios all at once. Then, they show this whole tree to the slow author. The author looks at the tree and says, "Okay, 'cat' is good, but 'dog' is wrong. And for 'cat', the next word 'sat' is perfect, but 'ran' is wrong."

Suddenly, in one single check, the author has approved a whole path of words. This is Tree Speculative Decoding. It's like the author reading a whole paragraph of your draft in one glance instead of checking word-by-word.

The Problem: The "Hardware" Mismatch

The paper explains that while this idea sounds great, it's a nightmare to build on certain types of computer chips (specifically Ascend NPUs, which are powerful AI chips made by Huawei).

Think of the AI chip like a very strict, high-security factory.

  1. The "Negative Index" Trap: In regular programming, you can say "go back 1 step" (index -1). But on these specific chips, saying "go back" is like trying to walk off the edge of a cliff; the factory shuts down or crashes. The authors had to build a safety net so the system never tries to walk off the edge.
  2. The "Leaky" Memory: When the intern is guessing different branches (Cat vs. Dog), they need to keep their notes separate. If the "Cat" branch accidentally reads the "Dog" branch's notes, the whole story gets messed up. The authors built a special Branch/Commit Manager (like a librarian with separate, locked folders) to ensure the "Cat" branch never sees the "Dog" branch's secrets until the author makes a final decision.
  3. The "Masking" Issue: The author needs to know which words can "talk" to which other words. In a tree, the word "sat" can look back at "cat," but it shouldn't look at "dog." The authors created a special Traffic Light System (a mask) that tells the computer exactly which words are allowed to look at each other, preventing information leaks.

The Solution: EAGLE-PANGU

The team built a system called EAGLE-PANGU that acts as a translator and safety inspector between the clever "Tree" idea and the strict "Ascend" factory.

  • The Safety Net: They replaced all the dangerous "walk off the cliff" instructions with safe, dummy instructions that the factory understands.
  • The Locked Folders: They created a way to copy the author's memory notes instantly for each branch without slowing everything down, ensuring no secrets leak between branches.
  • The Fast Lane: They optimized the system to use the factory's fastest assembly lines (fused kernels) so the author can check the whole tree quickly.

The Results

When they tested this system:

  • Speed: It made the AI 1.27 times faster on average. For some difficult, long conversations, it was nearly 2.5 times faster.
  • Reliability: It didn't crash. It handled the strict rules of the Ascend chips perfectly.
  • The "Sweet Spot": They found that making the tree too big (guessing too many branches) actually slows things down because the author gets overwhelmed checking them all. There is a perfect size for the tree, and they found it.

The Takeaway

Think of EAGLE-PANGU as a master architect who figured out how to build a complex, multi-lane highway (the tree of guesses) on top of a very strict, old-fashioned bridge (the Ascend chip). Before, trying to drive a sports car on that bridge would cause a crash. Now, thanks to their new safety rails and traffic lights, the car can zoom across, delivering the AI's answers much faster to the user.

They didn't invent a new way to write stories; they just invented a much safer and faster way to deliver the story on specific hardware.