Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Here is an explanation of the paper "Zipage" using simple language and creative analogies.

The Big Problem: The "Memory Overflow" in AI Brains

Imagine a Large Language Model (like the ones that write code or solve math problems) as a super-smart chef in a busy kitchen.

When this chef is cooking a complex dish (solving a hard math problem), they need to keep a lot of ingredients and notes on the counter (this is called the KV Cache).

The Issue: As the recipe gets longer and more complex, the chef keeps piling more and more notes on the counter. Eventually, the counter gets so cluttered that there's no room left for new ingredients.
The Consequence: The kitchen can only cook one complex dish at a time because the counter is full. If you try to start a second dish, the chef has to stop and clear the whole counter, which takes forever. This limits how many people can order food at once (low concurrency).

The Old Solutions: The "Brute Force" Cleanup

Previous attempts to fix this were like hiring a janitor to throw away random notes to make space.

The Problem: Sometimes the janitor throws away a crucial note (like "add salt at the end"), ruining the dish.
Other attempts: Some tried to throw away whole pages of notes at once. This is too blunt; you might throw away a page that has the most important secret ingredient on it.

The New Solution: Zipage (The "Smart, Compressed Filing System")

The authors of this paper built a new system called Zipage. Think of it as giving the chef a magic, self-compressing filing cabinet and a smart scheduling manager.

Here is how it works, broken down into three main tricks:

1. The "Magic Filing Cabinet" (Compressed PagedAttention)

Instead of a flat counter, the chef uses a filing cabinet with fixed-size drawers (called Pages).

The Rule: The chef is only allowed to have 4 drawers open for any single recipe at a time.
The Magic: When the 4th drawer gets full, the system doesn't just throw things away. It instantly scans the notes, identifies the "boring" ones (like "stir the pot" which happened 10 minutes ago), and compresses them.
The Result: It keeps the most important notes (the "critical path") in the first 3 drawers and throws the rest into a "trash can" (releasing the memory). This keeps the counter size constant, no matter how long the recipe gets.

2. The "Smart Manager" (Hybrid Scheduling)

Imagine a restaurant manager who is very good at juggling.

The Old Way: If the kitchen is full, the manager stops taking new orders until the current ones are done.
The Zipage Way: The manager knows that some orders are quick (short recipes) and some are long (complex math).
- If a "short recipe" comes in, the manager lets it in immediately, even if the kitchen is technically "full," because it won't take up much space.
- If a "long recipe" needs more space, the manager temporarily pauses it to let the short ones finish, then resumes the long one.
- The Benefit: The kitchen is always full of work, but never overflowing. This allows the restaurant to serve 2.1 times more customers at the same time.

3. The "Shared Blueprint" (Prefix Caching)

Imagine two customers order the exact same appetizer before their main courses.

The Old Way: The chef writes down the appetizer instructions twice, wasting space.
The Zipage Way: The system realizes, "Hey, these two orders share the first 50 steps!" It creates a shared blueprint for those steps. Both chefs use the same notes for the beginning, only writing new notes for the unique parts. This saves massive amounts of memory.

The "Async" Trick: Doing Two Things at Once

Usually, the chef has to stop cooking to organize the filing cabinet (compression). This slows everything down.

Zipage's Trick: The chef organizes the filing cabinet while the other chefs are still cooking. They work in parallel. The chef organizing the files doesn't stop the cooking line. This makes the whole kitchen run much smoother.

The Results: What Did They Achieve?

The researchers tested this on difficult math and coding problems (where the "recipes" are very long).

Speed: Zipage was 2.1 times faster than the standard systems.
Quality: Despite throwing away old notes, the chef still got 95% of the answers right compared to the system that kept every single note.
Capacity: The kitchen could handle way more orders at once without crashing.

Summary Analogy

If standard AI is a cluttered desk where you can only work on one big project before it gets too messy, Zipage is a smart, automated office assistant.

It throws away the sticky notes you don't need anymore.
It keeps the most important ones in a compact folder.
It lets you work on multiple projects at once by sharing the common parts of the files.
It does all the cleaning while you are still typing, so you never have to stop.

This allows the AI to be faster, handle more users, and solve harder problems without running out of memory.

Here is a detailed technical summary of the paper "Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention."

1. Problem Statement

Large Language Models (LLMs) with reasoning capabilities (e.g., for mathematics and coding) generate extremely long sequences. This creates a critical bottleneck in LLM serving systems:

Memory Bottleneck: The Key-Value (KV) cache required to store attention states grows linearly with sequence length. In long-reasoning scenarios, this exhausts GPU memory, severely limiting the number of concurrent requests a system can handle.
Limitations of Existing Solutions:
- Full KV Cache: Standard engines (like vLLM) run out of memory quickly as sequence lengths increase, forcing a reduction in batch size and concurrency.
- Naive Eviction Methods: Existing token-wise eviction methods (e.g., SnapKV) often lack integration with modern inference engines, failing to support continuous batching or prefix caching, leading to lower throughput.
- Coarse-Grained Eviction: Methods that evict entire pages (blocks) risk losing critical information, degrading model accuracy.
- Input-Only Compression: Some methods (e.g., KV-Compress) only compress the input, disrupting prefix caching and increasing prefilling costs.

2. Methodology: Compressed PagedAttention & Zipage

The authors propose Compressed PagedAttention, a novel KV cache management strategy that integrates PagedAttention with flexible, token-wise eviction. Based on this, they built Zipage, a high-concurrency LLM inference engine.

Core Components:

Compressed PagedAttention Mechanism:
- Block Capping: Each request is capped at a maximum number of blocks ( $N_{max}$ ).
- Token-wise Eviction: When a request exceeds $N_{max}$ blocks during decoding, a compression operation is triggered.
- Selection Strategy: The system calculates importance scores for KV cache entries using a combination of:
  - Attention Scores: Relevance between the current query and past keys.
  - Global Scores: Historical attention aggregation (from G-KV) to capture long-term importance.
  - Redundancy Scores: Identification of redundant tokens (from R-KV), optimized via a new "Lightning Redundancy Score" to reduce computational complexity from $O(N^2 \times b^2)$ to $O(N \times b^2)$ .
- Retention: The top- $k$ most important tokens are retained and compacted into the first $N_{max}-1$ blocks. The last block is reserved for new decoding, and excess blocks are released.
Hybrid Scheduling Strategy:
- To maximize GPU utilization, Zipage employs a hybrid scheduling policy that manages Query Slots (memory reserved for compression).
- Constrained Mode: Requests needing compression are assigned query slots.
- Unconstrained Mode: Requests with short sequences (not yet hitting $N_{max}$ ) can decode without query slots, allowing higher concurrency.
- Preemption: If memory is full, the system preempts requests that do not have query slots assigned, ensuring that requests needing compression can proceed.
Shared Prefix Caching:
- Standard compression disrupts shared prefixes (where multiple requests share the same initial tokens).
- Zipage modifies the compression strategy to redirect compressed data to target blocks rather than rearranging existing shared blocks. This preserves the reference count of shared blocks, allowing prefix caching to remain effective even after compression.
Asynchronous Execution:
- Compression and decoding are executed asynchronously. Requests ready for decoding proceed immediately, while those requiring compression are handled in parallel. This prevents the entire batch from stalling while a few requests undergo compression, significantly improving GPU utilization.

3. Key Contributions

Compressed PagedAttention: A unified framework combining PagedAttention with fine-grained, token-wise KV cache eviction that supports continuous batching and prefix caching.
Zipage Engine: A high-concurrency inference engine implementing the above method with optimized GPU kernels (using Triton) for scoring and compression.
Algorithmic Optimizations:
- Lightning Redundancy Score: A novel algorithm reducing the complexity of redundancy calculation, making compression feasible in real-time.
- Hybrid Scheduling: A novel scheduling policy that balances concurrency limits with memory constraints to prevent underutilization.
- Prefix-Aware Compression: A mechanism to maintain shared prefix efficiency despite dynamic KV cache eviction.
Asynchronous Pipeline: Decoupling compression from decoding to eliminate latency bottlenecks.

4. Experimental Results

The authors evaluated Zipage on mathematical reasoning (AMC 23, AIME 24) and coding (LiveCodeBench) tasks using Qwen3 and DeepSeek-R1 models.

Throughput (TPS): Zipage achieves a 2.1× to 3.3× speedup in tokens per second compared to standard Full KV inference engines (like vLLM) and significantly outperforms other eviction-based methods.
Accuracy (Pass@1): Zipage maintains ~95% of the performance of a Full KV cache engine on mathematical reasoning tasks when using a KV cache budget of 2048.
Comparison with Baselines:
- Outperforms Nano-vLLM (a lightweight PagedAttention implementation) by maintaining high concurrency without the periodic throughput drops caused by memory preemption.
- Outperforms MorphKV, R-KV, and G-KV which lack continuous batching support, resulting in lower throughput due to padding and inefficient batching.
Scalability: The performance gains are consistent across different model sizes (0.6B to 32B).

5. Significance

Enabling Industrial-Scale Reasoning: Zipage solves the memory bottleneck that currently prevents LLMs from serving high-concurrency, long-reasoning tasks in production environments.
Practicality: Unlike previous research that focuses solely on algorithmic compression, Zipage is a fully integrated inference engine compatible with existing features like prefix caching and continuous batching.
Efficiency: By introducing asynchronous compression and optimized scoring kernels, it demonstrates that high-concurrency serving does not require sacrificing model accuracy or introducing prohibitive latency.
Future Direction: The work paves the way for deploying complex reasoning models in resource-constrained or high-demand scenarios, bridging the gap between theoretical compression methods and practical, high-throughput serving.