Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

The paper introduces Zipage, an LLM inference engine utilizing Compressed PagedAttention to combine token-wise KV cache eviction with PagedAttention, achieving over 2.1×\times speedup in high-concurrency reasoning tasks while maintaining approximately 95% of the performance of full KV inference.

Mengqi Liao, Lu Wang, Chaoyun Zhang, Bo Qiao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Huaiyu Wan

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Here is an explanation of the paper "Zipage" using simple language and creative analogies.

The Big Problem: The "Memory Overflow" in AI Brains

Imagine a Large Language Model (like the ones that write code or solve math problems) as a super-smart chef in a busy kitchen.

When this chef is cooking a complex dish (solving a hard math problem), they need to keep a lot of ingredients and notes on the counter (this is called the KV Cache).

  • The Issue: As the recipe gets longer and more complex, the chef keeps piling more and more notes on the counter. Eventually, the counter gets so cluttered that there's no room left for new ingredients.
  • The Consequence: The kitchen can only cook one complex dish at a time because the counter is full. If you try to start a second dish, the chef has to stop and clear the whole counter, which takes forever. This limits how many people can order food at once (low concurrency).

The Old Solutions: The "Brute Force" Cleanup

Previous attempts to fix this were like hiring a janitor to throw away random notes to make space.

  • The Problem: Sometimes the janitor throws away a crucial note (like "add salt at the end"), ruining the dish.
  • Other attempts: Some tried to throw away whole pages of notes at once. This is too blunt; you might throw away a page that has the most important secret ingredient on it.

The New Solution: Zipage (The "Smart, Compressed Filing System")

The authors of this paper built a new system called Zipage. Think of it as giving the chef a magic, self-compressing filing cabinet and a smart scheduling manager.

Here is how it works, broken down into three main tricks:

1. The "Magic Filing Cabinet" (Compressed PagedAttention)

Instead of a flat counter, the chef uses a filing cabinet with fixed-size drawers (called Pages).

  • The Rule: The chef is only allowed to have 4 drawers open for any single recipe at a time.
  • The Magic: When the 4th drawer gets full, the system doesn't just throw things away. It instantly scans the notes, identifies the "boring" ones (like "stir the pot" which happened 10 minutes ago), and compresses them.
  • The Result: It keeps the most important notes (the "critical path") in the first 3 drawers and throws the rest into a "trash can" (releasing the memory). This keeps the counter size constant, no matter how long the recipe gets.

2. The "Smart Manager" (Hybrid Scheduling)

Imagine a restaurant manager who is very good at juggling.

  • The Old Way: If the kitchen is full, the manager stops taking new orders until the current ones are done.
  • The Zipage Way: The manager knows that some orders are quick (short recipes) and some are long (complex math).
    • If a "short recipe" comes in, the manager lets it in immediately, even if the kitchen is technically "full," because it won't take up much space.
    • If a "long recipe" needs more space, the manager temporarily pauses it to let the short ones finish, then resumes the long one.
    • The Benefit: The kitchen is always full of work, but never overflowing. This allows the restaurant to serve 2.1 times more customers at the same time.

3. The "Shared Blueprint" (Prefix Caching)

Imagine two customers order the exact same appetizer before their main courses.

  • The Old Way: The chef writes down the appetizer instructions twice, wasting space.
  • The Zipage Way: The system realizes, "Hey, these two orders share the first 50 steps!" It creates a shared blueprint for those steps. Both chefs use the same notes for the beginning, only writing new notes for the unique parts. This saves massive amounts of memory.

The "Async" Trick: Doing Two Things at Once

Usually, the chef has to stop cooking to organize the filing cabinet (compression). This slows everything down.

  • Zipage's Trick: The chef organizes the filing cabinet while the other chefs are still cooking. They work in parallel. The chef organizing the files doesn't stop the cooking line. This makes the whole kitchen run much smoother.

The Results: What Did They Achieve?

The researchers tested this on difficult math and coding problems (where the "recipes" are very long).

  • Speed: Zipage was 2.1 times faster than the standard systems.
  • Quality: Despite throwing away old notes, the chef still got 95% of the answers right compared to the system that kept every single note.
  • Capacity: The kitchen could handle way more orders at once without crashing.

Summary Analogy

If standard AI is a cluttered desk where you can only work on one big project before it gets too messy, Zipage is a smart, automated office assistant.

  1. It throws away the sticky notes you don't need anymore.
  2. It keeps the most important ones in a compact folder.
  3. It lets you work on multiple projects at once by sharing the common parts of the files.
  4. It does all the cleaning while you are still typing, so you never have to stop.

This allows the AI to be faster, handle more users, and solve harder problems without running out of memory.