Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

Imagine you are reading a very long, complex novel. As you read, you need to remember the characters, the plot twists, and the setting to understand what's happening next.

The Problem: The "Heavy Backpack" of Memory
Currently, when AI models (like the ones powering chatbots) read a long story, they carry a "backpack" of every single word they've ever seen in that story. Every time they guess the next word, they have to dig through this entire, growing backpack to find the most relevant clues.

The Analogy: Imagine trying to write a sentence while carrying a backpack that gets heavier with every word you write. To write the next word, you have to stop, unzip the whole backpack, search through thousands of pages of notes, and then write. As the story gets longer, this process becomes incredibly slow and exhausting.

The Observation: "The Plot Doesn't Change Every Second"
The researchers behind this paper noticed something interesting about how humans (and AI) read. When you are in the middle of a single sentence or a short paragraph, the things you need to remember don't change every single word.

The Analogy: If you are reading a paragraph about a "cat sitting on a mat," the fact that there is a "cat" and a "mat" is relevant for the whole paragraph. You don't need to re-read the first sentence of the book to know the cat is still there. The "important stuff" stays stable for a while.

The Solution: Slow-Fast Inference (SFI)
The paper proposes a new way to read called Slow-Fast Inference. It's like hiring a smart assistant who knows when to work hard and when to coast.

1. The "Fast Steps" (Coasting)

Most of the time, the AI doesn't need to dig through the whole backpack.

How it works: The AI creates a tiny, compact "cheat sheet" containing only the most important things it needs right now (like the current character names and the immediate context).
The Analogy: Instead of opening the whole backpack, the assistant pulls out a small index card with the key facts. They write the next few words of the story using just this card. This is super fast and requires very little energy.
When it happens: This happens for most of the words in a sentence.

2. The "Slow Steps" (The Deep Dive)

Every now and then, the story hits a major turning point (like a new paragraph, a new scene, or a sentence ending).

How it works: The AI stops coasting. It opens the full backpack, reads the whole history again, and figures out what the new most important things are. It then updates its "cheat sheet" with this fresh information.
The Analogy: The assistant realizes, "Wait, the cat just jumped off the mat and is now chasing a dog!" The old cheat sheet is outdated. So, they quickly scan the whole story again, update the card with "Dog" and "Chasing," and close the backpack.
The Trigger: This happens automatically at natural breaks in the text (like periods or new paragraphs) or if the AI has gone too long without checking.

3. The "Selector" (The Smart Librarian)

The hardest part is deciding what to put on the cheat sheet. If you pick the wrong things, the story makes no sense.

How it works: The paper introduces a special tool called a Selector. When the AI does a "Slow Step" and reads the whole story, the Selector acts like a super-smart librarian. It looks at the whole story and says, "Okay, for the next few sentences, we definitely need to remember the 'Cat' and the 'Dog,' but we can forget the 'Red Hat' from three pages ago."
The Magic: It uses a clever mathematical trick to mix the fresh reading with some general rules about how stories usually work, ensuring the cheat sheet is perfect for the next batch of "Fast Steps."

Why This Matters

Speed: Because the AI spends 90% of its time using the tiny cheat sheet (Fast Steps) instead of the giant backpack, it can read and write 1.6 to 14 times faster.
Quality: Even though it's skipping the heavy lifting most of the time, it checks in often enough (Slow Steps) that it doesn't lose the plot. The quality of the story remains just as good as if it had read everything every time.
No Training Needed: The best part? You don't need to teach the AI a new way of thinking. You can just give this "Slow-Fast" rule to any existing AI model, and it works immediately.

In Summary:
Think of Slow-Fast Inference as a runner who usually jogs lightly (Fast Steps) but stops briefly at every mile marker to check the map and adjust their route (Slow Steps). This is much faster than stopping to check the map after every single step, but it ensures they never get lost. This allows AI to handle massive stories and complex reasoning tasks without slowing down to a crawl.

1. Problem Statement

Large Language Models (LLMs) face significant inference bottlenecks in long-context and long-chain-of-thought (CoT) scenarios. While KV caching eliminates repeated key/value projections, every autoregressive decoding step still requires computing attention over the entire accessible history. As context length grows, this results in:

High Computational Cost: $O(N)$ complexity per step where $N$ is the context length.
Memory Bandwidth Saturation: Repeatedly fetching the full KV cache from memory becomes the dominant bottleneck, limiting throughput.
Inefficiency of Current Methods: Existing training-free methods (e.g., static eviction, sliding windows, or per-step retrieval) often either degrade quality significantly or fail to reduce the frequency of expensive dense attention operations sufficiently.

The core question addressed is: Does the model's attention focus truly reorganize at every token, or does it exhibit temporal structure that can be exploited?

2. Key Observation: Within-Sentence Support Stability

The authors observe a consistent pattern during decoding: Within a sentence (or a short semantically coherent span), the dominant attention support remains largely stable.

The model assigns most attention mass to a largely overlapping subset of past positions within a semantic unit.
Significant shifts in attention support (reconfiguration) tend to occur primarily near semantic boundaries (e.g., sentence endings, paragraph breaks).
This "temporal stability" suggests that the model does not need to re-evaluate the entire history at every single token generation step.

3. Methodology: Slow-Fast Inference (SFI)

SFI is a training-free decoding framework that decouples generation into two distinct phases: Fast Steps and Slow Steps.

A. The Slow-Fast Paradigm

Fast Steps (Frequent, Low-Cost):
- Occur for the majority of decoding steps.
- The model attends only to a managed sparse state consisting of:
  - Sink Tokens: A small fixed set of anchor tokens (e.g., start of prompt) providing global stability.
  - Recent Window: A sliding window of the most recent tokens to preserve local continuity.
  - Selected Memory: A compact set of long-range tokens identified as critical by the Selector.
- The sparse index set is reused without recomputation, drastically reducing FLOPs and memory access.
Slow Steps (Occasional, Dense):
- Triggered near semantic boundaries (detected via token IDs like ., ?, !) or when a maximum reuse budget ( $T_{max}$ ) is reached.
- The model performs dense full attention over the entire accessible history.
- The output of this dense attention is used to refresh the Selected Memory for the subsequent Fast Steps.

B. The Training-Free Selector

The core innovation is a Selector that converts dense attention evidence from a Slow Step into a reusable sparse memory for subsequent Fast Steps. It operates without retraining using a Reverse-KL Fusion approach:

Evidence ( $f$ ): Derived from the masked attention logits of the Slow Step window.
Prior ( $r$ ): A lightweight, cache-aware prior derived from structural statistics (e.g., key norms and position decay) to prevent over-concentration on recent tokens or outliers with large norms.
Fusion: The Selector minimizes a convex combination of KL divergences:
$s_\lambda = \arg\min_s (1-\lambda) D_{KL}(f \| s) + \lambda D_{KL}(r \| s)$
This yields a closed-form solution: $s_\lambda = (1-\lambda)f + \lambda r$ .
- Why Reverse KL? It produces an arithmetic mixture, ensuring that if a token is important in either the evidence or the prior, it retains a high score. This is more robust for candidate selection than geometric mixtures (Forward KL), which require agreement from both sources.
Discretization: The fused scores undergo Soft-NMS (to suppress redundant nearby tokens within a head) and Cross-Head Exclusivity (to ensure diversity across heads) before selecting the Top- $K$ indices.

C. System-Level Optimizations

To translate algorithmic savings into wall-clock speedups, the authors implement:

Asynchronous Pipeline: Overlaps the "Slow Step" maintenance (Selector execution and cache reorganization) of layer $i$ with the dense attention computation of layer $i+1$ , hiding latency.
Memory-Coalesced Kernels: Reorganizes the selected KV pairs into a contiguous compact buffer after every Slow Step. This allows Fast Steps to read long-range context via sequential, high-bandwidth memory access, avoiding the bandwidth collapse associated with scattered gathers from paged KV caches.

4. Key Contributions

Identification of Support Stability: Demonstrated that attention support is stable within semantic spans, enabling event-driven decoding.
SFI Framework: Proposed a training-free, model-agnostic framework that alternates between sparse reuse and dense refresh.
Closed-Form Selector: Developed a mathematically grounded Selector using Reverse-KL fusion to balance dense evidence with structural priors, ensuring robust memory selection without fine-tuning.
System Implementation: Designed asynchronous pipelines and memory-coalesced kernels to overcome GPU memory bottlenecks, achieving real-world throughput gains.

5. Experimental Results

Evaluated on Qwen3 models (0.6B to 235B) across LongBench (V1/V2), GPQA, and MMLU.

Throughput Gains:
- SFI achieves 1.6× to 14.4× higher decoding throughput compared to full-KV baselines.
- Gains scale with context length (e.g., 1.91× at 8K $\to$ 14.36× at 128K for Qwen3-4B).
- Throughput degradation with context length is significantly slower than full-KV baselines.
Quality Preservation:
- Long-Context: SFI matches or slightly improves upon full-KV baselines on LongBench-V1 and V2. Notably, on LongBench-V2 (Long subset), SFI improved Qwen3-235B-A22B scores by +4.8 points.
- Long-CoT: Performance remains stable on GPQA and MMLU, with large models (30B, 235B) showing negligible or positive gains.
Comparison: SFI outperforms other training-free KV-cache compression methods (e.g., StreamingLLM, SnapKV, PyramidKV) even though SFI retains a much smaller token budget (~15-20% vs. 50% for others). This highlights the quality of the Selector over simple retention ratios.

6. Significance

Practical Deployment: As a training-free method, SFI can be applied directly to existing model checkpoints (including proprietary ones) without retraining or architectural changes.
Scalability: It offers a viable path to reducing inference costs for agentic workloads, multi-agent systems, and long-horizon reasoning, where context windows are rapidly expanding.
Paradigm Shift: Moves away from "sparsify every step" to "exploit temporal stability," suggesting that the key to efficient long-context inference lies in recognizing and leveraging the structure of attention evolution rather than just compressing data.