Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

This paper introduces the Longest Stable Prefix (LSP) scheduler, a training-free inference paradigm for Diffusion Language Models that accelerates generation by up to 3.4x through contiguous prefix absorption, thereby resolving KV cache fragmentation and improving hardware efficiency without compromising output quality.

Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you are trying to write a long story, but you have a magical assistant (the AI) who can look at the whole story at once, rather than writing it one word at a time. This is how Diffusion Language Models (DLMs) work. They are like a sculptor who starts with a block of marble covered in fog and gradually clears away the fog to reveal the statue.

However, there's a problem. The current way these models "clear the fog" is messy and slow. This paper introduces a new, much faster way to do it called LSP (Longest Stable Prefix).

Here is the breakdown using simple analogies:

1. The Problem: The "Scattered Acceptance" Mess

Imagine you are building a wall with bricks.

  • The Old Way (Scattered Acceptance): You look at the whole wall. You see that Brick #3 is solid, so you glue it down. Then you see Brick #10 is solid, so you glue that down. Then Brick #7.
  • The Result: You have a wall with solid bricks scattered everywhere, separated by gaps of "maybe" bricks.
  • Why it's bad:
    1. Confusion: Every time you glue a brick, the model has to check how it fits with the other scattered bricks. It's like trying to build a puzzle where the pieces keep moving around.
    2. Memory Chaos: In computer terms, this scatters the model's "memory" (called the KV cache). Instead of a neat, continuous line of memory, it's a bunch of tiny, disconnected fragments. This makes the computer's brain (the processor) work much harder to find the pieces it needs, slowing everything down.

2. The Solution: The "Longest Stable Prefix" (LSP)

Now, imagine a smarter builder.

  • The New Way (LSP): The builder looks at the wall and says, "I see a solid, unbroken section starting from the very beginning. Let's lock that whole section down at once."
  • The Process:
    1. Look: The model checks the beginning of the sentence.
    2. Judge: It asks, "Is this first word solid? Yes. Is the second? Yes. Is the third? Yes. Is the fourth? Hmm, maybe not yet."
    3. Snap: It locks in the first three words. But wait! The third word is in the middle of a sentence. The model says, "No, let's wait until the end of this sentence or clause." It extends the locked section to the nearest punctuation mark (like a period or comma).
    4. Commit: It permanently locks in that whole chunk (the "Prefix") and moves on to the rest.

3. Why This is a Game-Changer

The paper argues that this "Prefix-First" approach is like switching from a chaotic construction site to a highly organized assembly line.

  • Memory Efficiency (The Library Analogy):

    • Old Way: Imagine a library where books are scattered randomly on the floor. To read the next chapter, the librarian has to run all over the room to find them.
    • New Way: The books are stacked perfectly in order on a shelf. The librarian just grabs the next stack. This is what LSP does to the computer's memory. It keeps the "locked" part of the text in one neat, continuous block, making it incredibly fast to access.
  • Fewer Mistakes (The "Repair" Analogy):

    • Old Way: Because the model locks down scattered pieces, it often locks in a word that turns out to be wrong later. It has to go back, un-glue the brick, and fix it. This "repair work" happens over and over.
    • New Way: By locking down a whole coherent chunk (like a full sentence) at once, the model is less likely to make mistakes that need fixing later. It stabilizes the story early on.
  • Speed:
    Because the computer doesn't have to jump around in its memory or constantly fix mistakes, the whole process speeds up dramatically. The paper shows this can make the AI 3.4 times faster without making the answers worse. In fact, for math and coding tasks, the answers sometimes get better because the model isn't distracted by fixing its own messy work.

Summary

The paper says: Stop trying to glue down individual, scattered words. Instead, find the longest, solid chunk of text starting from the beginning, lock it down neatly, and move forward.

It turns a chaotic, slow, and error-prone process into a smooth, fast, and organized one, unlocking the true speed potential of these powerful AI models.