Prism: Cost-Efficient Multi-LLM Serving via GPU Memory… — Plain-Language Explanation

Original authors: Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Har

Published 2026-06-12

📖 4 min read☕ Coffee break read

View on arXiv ↗PDF ↗

CC BY 4.0

Original authors: Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Harry Xu, Junyi Shu, Jiarong Xing, Ying Sheng

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you run a massive hotel with thousands of rooms (GPUs) and thousands of different guests (AI models). Some guests are celebrities who want a room 24/7, while others are tourists who only show up for a 10-minute check-in once a day.

The problem is that your hotel is expensive to run. If you give every tourist their own private room just in case they show up, you end up with 90% of your hotel empty and wasted. But if you try to squeeze everyone into one room, you get chaos, and the celebrities get angry because they have to wait.

Prism is a new, smart hotel manager that solves this by using a trick called "Memory Ballooning."

Here is how it works, broken down into simple concepts:

1. The Problem: The "Static Room" Trap

In the old way of running AI, if a model (a guest) was assigned a room, that room was theirs forever, even if they were sleeping (idle).

Space Sharing (The Old Way): You try to put multiple guests in one room. It works great if they are all awake and chatting. But if one guest leaves for a week, their half of the room sits empty, and the other guest can't use it.
Time Sharing (The Other Old Way): You kick one guest out to let another in. This works if guests only come at different times. But if two guests arrive at the exact same second, you have to constantly kick them in and out of the room. This "kicking" is slow and makes everyone wait (lag), causing them to miss their deadlines.

Real-world AI traffic is messy. Sometimes a group of models all get busy at once, and sometimes they all go quiet. No single old strategy could handle this switching.

2. The Solution: The "Ballooning" Trick

Prism introduces a new manager called kvcached (the balloon driver). Think of the GPU memory not as a set of fixed rooms, but as inflatable balloons.

The Elastic Balloon: When a model is busy and needs more space to think, the manager inflates its balloon, stealing empty air from other models that are currently sleeping.
Deflating for Others: When a model goes to sleep, its balloon shrinks, releasing that space so a new, waking-up model can instantly inflate its own balloon.
No Moving Furniture: The best part? The models don't even know this is happening. They just see a room that magically expands and contracts. The manager handles the heavy lifting behind the scenes.

3. The Two-Step Strategy

Prism uses two smart rules to decide who gets the air:

Rule 1: The Global Scheduler (The Hotel Manager): This looks at the whole hotel. It asks, "Which group of guests is currently active?" It then places those active guests on the same floor (GPU) so they can share space easily. If a guest is sleeping, it moves them to a storage closet (CPU) to free up space. It constantly rearranges the hotel to make sure no floor is overcrowded while another is empty.
Rule 2: The Local Scheduler (The Concierge): This looks at the specific requests coming in right now. If two guests are fighting for the last bit of space, the concierge checks who has the most urgent deadline. It lets the urgent guest in first and tells the less urgent one to wait a moment. This ensures the most important tasks get done on time.

4. The Results

The paper tested Prism on real-world data from major AI providers and found:

Faster Service: It met its speed promises (SLOs) up to 3.3 times better than previous methods.
Cheaper Costs: To get the same level of performance, Prism needed half the number of GPUs (or could handle twice as many requests with the same hardware).
Real-World Proof: It has already been deployed in production environments with over 10,000 GPUs, helping companies generate significantly more revenue per GPU by turning wasted "idle" time into billable work.

Summary

Prism is like a smart, elastic hotel manager. Instead of locking guests into fixed rooms or kicking them out constantly, it uses inflatable balloons to dynamically share space. It expands space for busy models and shrinks it for sleeping ones, ensuring the hotel is always full, efficient, and fast, without anyone waiting in line.

Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

1. The Problem: The "Static Room" Trap

2. The Solution: The "Ballooning" Trick

3. The Two-Step Strategy

4. The Results

Summary

Technical Summary: Prism

Problem Statement

Methodology

1. GPU Memory Ballooning (The `kvcached` Driver)

2. Memory-Centric Control Plane

3. System Integration

Key Contributions

Results

Significance

Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

1. The Problem: The "Static Room" Trap

2. The Solution: The "Ballooning" Trick

3. The Two-Step Strategy

4. The Results

Summary

Technical Summary: Prism

Problem Statement

Methodology

1. GPU Memory Ballooning (The kvcached Driver)

2. Memory-Centric Control Plane

3. System Integration

Key Contributions

Results

Significance

More like this

1. GPU Memory Ballooning (The `kvcached` Driver)