Here is an explanation of the paper using simple language and everyday analogies.
The Big Picture: The "Assembly Line" Problem
Imagine you run a massive, high-end bakery that makes custom cakes (these are your Large Language Models or LLMs).
Traditionally, you had one baker who did everything:
- Reading the Order (Prefill): They read the customer's long, complex request (e.g., "Write a 50-page story about a dragon"). This takes a lot of brainpower but happens fast.
- Baking the Cake (Decode): They then write the story word-by-word. This is slower and requires a steady, rhythmic pace.
The Problem: When you have a rush of customers, the baker gets stuck. If they are busy reading a new order, they can't write the next word for the previous customer. If they are busy writing, they can't read new orders. This causes long wait times for the first word (called TTFT) and slow typing speeds for the rest of the story (called TPOT).
The Solution (P/D Disaggregation):
To fix this, you split the bakery into two separate stations:
- Station A (Prefill): Specialized bakers who only read orders and prepare the dough.
- Station B (Decode): Specialized bakers who only write the story word-by-word.
- The Conveyor Belt: Once Station A is done, they pass the "dough" (the context) to Station B.
This is great! But now you have a new, tricky question: How many bakers do I need at Station A versus Station B?
- If you hire too many Station A bakers but not enough Station B bakers, the dough piles up, and customers wait too long for their story to start.
- If you hire too many Station B bakers but not enough Station A bakers, the writing team sits idle waiting for new dough.
The Paper's Goal: The "Goldilocks" Calculator
This paper by Kingsoft Cloud provides a mathematical recipe to figure out the exact number of bakers (GPUs) you need for each station to keep costs low while making sure customers are happy (meeting SLOs or Service Level Objectives).
They don't just guess; they use a mix of Theory and Real-World Testing.
Step 1: The Theory (The Traffic Light Model)
The authors realized that the "Reading" station (Prefill) acts like a busy intersection.
- The Analogy: Imagine a single-lane road (the Prefill GPU). Cars (requests) arrive at random times.
- The Constraint: You promise every driver they will get through the intersection in under 2 seconds (TTFT).
- The Math: If you let the road run at 100% capacity, traffic jams happen, and the 2-second promise breaks. You have to slow down the flow slightly to keep the wait time low.
The paper uses a famous math concept called M/M/1 Queuing Theory (think of it as a traffic flow calculator) to answer: "If I want the wait time to be under 2 seconds, how many cars can I actually let through per minute?"
This gives them the effective speed of the Prefill station.
Step 2: The Real-World Test (The Running Track)
For the "Writing" station (Decode), the math is a bit different. It's like a runner on a track.
- The Analogy: A runner can carry one water bottle (batch size) or ten.
- The Trade-off: Carrying 10 bottles is more efficient (higher throughput), but it makes the runner slower and more tired (higher TPOT or Time Per Output Token).
- The Test: The team ran experiments to see exactly how many bottles a runner can carry before they start running too slowly to meet the customer's speed requirement.
This gives them the effective speed of the Decode station.
Step 3: The Final Recipe
Once they know:
- How fast Station A can work while keeping wait times low.
- How fast Station B can work while keeping typing speeds high.
- How long the average "order" is (input length) and how long the "story" is (output length).
They plug these numbers into a simple formula to get the Perfect Ratio.
Example from the paper:
They found that for a specific workload, the perfect setup was 3 Prefill bakers and 4 Decode bakers (a 3P4D setup).
- If they tried 3 and 3, the writing team would be the bottleneck, and customers would get angry at the slow typing speed.
- If they tried 3 and 5, they would have wasted money on an extra baker who just stood around doing nothing.
Why This Matters
Before this paper, companies had to guess or run expensive simulations to figure out their hardware needs. They might buy too much expensive hardware (wasting money) or too little (making customers unhappy).
This paper gives them a calculator:
- Tell me your speed requirements (SLOs).
- Tell me your average order size.
- I will tell you exactly how many GPUs to buy for the "Reading" team and the "Writing" team to get the most bang for your buck.
Summary
- The Problem: Splitting AI work into "Reading" and "Writing" is efficient, but hard to balance.
- The Solution: A hybrid method using traffic math (Queuing Theory) for the "Reading" part and real-world running tests for the "Writing" part.
- The Result: A precise formula to tell companies exactly how many computers (GPUs) to buy to save money while keeping users happy.