Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

The paper proposes Slow-Fast Inference (SFI), a training-free framework that accelerates long-context autoregressive decoding by dynamically alternating between low-cost fast steps using stable sparse memory and occasional slow steps that refresh context at semantic boundaries, achieving significant throughput gains without compromising generation quality.

Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh, Shuicheng Yan

Published 2026-03-13
📖 5 min read🧠 Deep dive

Imagine you are reading a very long, complex novel. As you read, you need to remember the characters, the plot twists, and the setting to understand what's happening next.

The Problem: The "Heavy Backpack" of Memory
Currently, when AI models (like the ones powering chatbots) read a long story, they carry a "backpack" of every single word they've ever seen in that story. Every time they guess the next word, they have to dig through this entire, growing backpack to find the most relevant clues.

  • The Analogy: Imagine trying to write a sentence while carrying a backpack that gets heavier with every word you write. To write the next word, you have to stop, unzip the whole backpack, search through thousands of pages of notes, and then write. As the story gets longer, this process becomes incredibly slow and exhausting.

The Observation: "The Plot Doesn't Change Every Second"
The researchers behind this paper noticed something interesting about how humans (and AI) read. When you are in the middle of a single sentence or a short paragraph, the things you need to remember don't change every single word.

  • The Analogy: If you are reading a paragraph about a "cat sitting on a mat," the fact that there is a "cat" and a "mat" is relevant for the whole paragraph. You don't need to re-read the first sentence of the book to know the cat is still there. The "important stuff" stays stable for a while.

The Solution: Slow-Fast Inference (SFI)
The paper proposes a new way to read called Slow-Fast Inference. It's like hiring a smart assistant who knows when to work hard and when to coast.

1. The "Fast Steps" (Coasting)

Most of the time, the AI doesn't need to dig through the whole backpack.

  • How it works: The AI creates a tiny, compact "cheat sheet" containing only the most important things it needs right now (like the current character names and the immediate context).
  • The Analogy: Instead of opening the whole backpack, the assistant pulls out a small index card with the key facts. They write the next few words of the story using just this card. This is super fast and requires very little energy.
  • When it happens: This happens for most of the words in a sentence.

2. The "Slow Steps" (The Deep Dive)

Every now and then, the story hits a major turning point (like a new paragraph, a new scene, or a sentence ending).

  • How it works: The AI stops coasting. It opens the full backpack, reads the whole history again, and figures out what the new most important things are. It then updates its "cheat sheet" with this fresh information.
  • The Analogy: The assistant realizes, "Wait, the cat just jumped off the mat and is now chasing a dog!" The old cheat sheet is outdated. So, they quickly scan the whole story again, update the card with "Dog" and "Chasing," and close the backpack.
  • The Trigger: This happens automatically at natural breaks in the text (like periods or new paragraphs) or if the AI has gone too long without checking.

3. The "Selector" (The Smart Librarian)

The hardest part is deciding what to put on the cheat sheet. If you pick the wrong things, the story makes no sense.

  • How it works: The paper introduces a special tool called a Selector. When the AI does a "Slow Step" and reads the whole story, the Selector acts like a super-smart librarian. It looks at the whole story and says, "Okay, for the next few sentences, we definitely need to remember the 'Cat' and the 'Dog,' but we can forget the 'Red Hat' from three pages ago."
  • The Magic: It uses a clever mathematical trick to mix the fresh reading with some general rules about how stories usually work, ensuring the cheat sheet is perfect for the next batch of "Fast Steps."

Why This Matters

  • Speed: Because the AI spends 90% of its time using the tiny cheat sheet (Fast Steps) instead of the giant backpack, it can read and write 1.6 to 14 times faster.
  • Quality: Even though it's skipping the heavy lifting most of the time, it checks in often enough (Slow Steps) that it doesn't lose the plot. The quality of the story remains just as good as if it had read everything every time.
  • No Training Needed: The best part? You don't need to teach the AI a new way of thinking. You can just give this "Slow-Fast" rule to any existing AI model, and it works immediately.

In Summary:
Think of Slow-Fast Inference as a runner who usually jogs lightly (Fast Steps) but stops briefly at every mile marker to check the map and adjust their route (Slow Steps). This is much faster than stopping to check the map after every single step, but it ensures they never get lost. This allows AI to handle massive stories and complex reasoning tasks without slowing down to a crawl.