SLO-Aware Compute Resource Allocation for Prefill-Decode Disaggregated LLM Inference

Here is an explanation of the paper using simple language and everyday analogies.

The Big Picture: The "Assembly Line" Problem

Imagine you run a massive, high-end bakery that makes custom cakes (these are your Large Language Models or LLMs).

Traditionally, you had one baker who did everything:

Reading the Order (Prefill): They read the customer's long, complex request (e.g., "Write a 50-page story about a dragon"). This takes a lot of brainpower but happens fast.
Baking the Cake (Decode): They then write the story word-by-word. This is slower and requires a steady, rhythmic pace.

The Problem: When you have a rush of customers, the baker gets stuck. If they are busy reading a new order, they can't write the next word for the previous customer. If they are busy writing, they can't read new orders. This causes long wait times for the first word (called TTFT) and slow typing speeds for the rest of the story (called TPOT).

The Solution (P/D Disaggregation):
To fix this, you split the bakery into two separate stations:

Station A (Prefill): Specialized bakers who only read orders and prepare the dough.
Station B (Decode): Specialized bakers who only write the story word-by-word.
The Conveyor Belt: Once Station A is done, they pass the "dough" (the context) to Station B.

This is great! But now you have a new, tricky question: How many bakers do I need at Station A versus Station B?

If you hire too many Station A bakers but not enough Station B bakers, the dough piles up, and customers wait too long for their story to start.
If you hire too many Station B bakers but not enough Station A bakers, the writing team sits idle waiting for new dough.

The Paper's Goal: The "Goldilocks" Calculator

This paper by Kingsoft Cloud provides a mathematical recipe to figure out the exact number of bakers (GPUs) you need for each station to keep costs low while making sure customers are happy (meeting SLOs or Service Level Objectives).

They don't just guess; they use a mix of Theory and Real-World Testing.

Step 1: The Theory (The Traffic Light Model)

The authors realized that the "Reading" station (Prefill) acts like a busy intersection.

The Analogy: Imagine a single-lane road (the Prefill GPU). Cars (requests) arrive at random times.
The Constraint: You promise every driver they will get through the intersection in under 2 seconds (TTFT).
The Math: If you let the road run at 100% capacity, traffic jams happen, and the 2-second promise breaks. You have to slow down the flow slightly to keep the wait time low.

The paper uses a famous math concept called M/M/1 Queuing Theory (think of it as a traffic flow calculator) to answer: "If I want the wait time to be under 2 seconds, how many cars can I actually let through per minute?"

This gives them the effective speed of the Prefill station.

Step 2: The Real-World Test (The Running Track)

For the "Writing" station (Decode), the math is a bit different. It's like a runner on a track.

The Analogy: A runner can carry one water bottle (batch size) or ten.
The Trade-off: Carrying 10 bottles is more efficient (higher throughput), but it makes the runner slower and more tired (higher TPOT or Time Per Output Token).
The Test: The team ran experiments to see exactly how many bottles a runner can carry before they start running too slowly to meet the customer's speed requirement.

This gives them the effective speed of the Decode station.

Step 3: The Final Recipe

Once they know:

How fast Station A can work while keeping wait times low.
How fast Station B can work while keeping typing speeds high.
How long the average "order" is (input length) and how long the "story" is (output length).

They plug these numbers into a simple formula to get the Perfect Ratio.

Example from the paper:
They found that for a specific workload, the perfect setup was 3 Prefill bakers and 4 Decode bakers (a 3P4D setup).

If they tried 3 and 3, the writing team would be the bottleneck, and customers would get angry at the slow typing speed.
If they tried 3 and 5, they would have wasted money on an extra baker who just stood around doing nothing.

Why This Matters

Before this paper, companies had to guess or run expensive simulations to figure out their hardware needs. They might buy too much expensive hardware (wasting money) or too little (making customers unhappy).

This paper gives them a calculator:

Tell me your speed requirements (SLOs).
Tell me your average order size.
I will tell you exactly how many GPUs to buy for the "Reading" team and the "Writing" team to get the most bang for your buck.

Summary

The Problem: Splitting AI work into "Reading" and "Writing" is efficient, but hard to balance.
The Solution: A hybrid method using traffic math (Queuing Theory) for the "Reading" part and real-world running tests for the "Writing" part.
The Result: A precise formula to tell companies exactly how many computers (GPUs) to buy to save money while keeping users happy.

Here is a detailed technical summary of the paper "SLO-Aware Compute Resource Allocation for Prefill-Decode Disaggregated LLM Inference."

1. Problem Statement

Large Language Model (LLM) inference traditionally deploys on non-disaggregated hardware where the Prefill (compute-bound) and Decode (memory-bandwidth-bound) phases share the same GPU resources. This coupling causes interference, making it difficult to simultaneously optimize Time-To-First-Token (TTFT) and Time-Per-Output-Token (TPOT), while also hindering independent hardware optimization for each phase.

While Prefill-Decode (P/D) disaggregation has emerged as a solution to decouple these phases, a critical operational gap remains: there is no established methodology to determine the optimal ratio and count of Prefill vs. Decode hardware resources (e.g., how many GPUs to allocate to P vs. D) for a specific workload.

Current Limitations: Existing tools (like NVIDIA's AIConfigurator) often rely on search-based approaches or fail to comprehensively account for user-defined throughput requirements and strict Service Level Objectives (SLOs) simultaneously. Improper allocation leads to either resource underutilization (high cost) or SLO violations (poor user experience).

2. Methodology

The authors propose a hybrid approach combining theoretical modeling with empirical benchmarking to calculate the optimal number of P/D instances ( $N_{prefill}$ and $N_{decode}$ ).

A. Theoretical Resource Calculation Model

The core objective is to balance the total computation time of both phases to avoid idling.

Total Throughput Definition: Defined as total tokens processed per second ( $TP_{total}$ ), derived from the number of requests ( $N_{req}$ ), average input length ( $L_{in}$ ), and output length ( $L_{out}$ ).
Balancing Equations: In a pipelined P/D system, the total time is determined by the slower phase. To maximize efficiency, $T_{prefill}$ must equal $T_{decode}$ .
Resource Formulas: The paper derives formulas (Eq. 5 & 6) to calculate the required number of instances:
$N_{prefill} = \frac{TP_{total} \times L_{in}}{(L_{in} + L_{out}) \times TP_{prefill}}$
$N_{decode} = \frac{TP_{total} \times L_{out}}{(L_{in} + L_{out}) \times TP_{decode}}$
Crucially, the ratio $R_{P/D}$ depends on the input/output lengths and the effective throughput of each phase, not just raw hardware specs.

B. Determining Effective Prefill Throughput (Under TTFT Constraints)

Since the maximum theoretical prefill throughput is rarely achievable under strict SLOs, the authors model the prefill phase using M/M/1 queuing theory:

TTFT Composition: $TTFT = T_{queuing} + T_{computation} + T_{overhead}$ (where overhead includes KV cache transfer and network latency).
Queuing Model: The system is modeled as an M/M/1 queue where the service rate $\mu$ is the max benchmarked throughput divided by input length.
Derivation: By setting the target TTFT, the authors derive a formula (Eq. 13) to calculate the actual achievable prefill throughput ( $TP_{prefill}$ ) that satisfies the TTFT constraint. This accounts for the fact that stricter TTFT requirements force lower system utilization to reduce queuing time.

C. Determining Effective Decode Throughput (Under TPOT Constraints)

For the decode phase, the relationship between batch size, throughput, and TPOT is non-linear.

Empirical Benchmarking: The authors observe that increasing the decode batch size increases throughput but also increases TPOT.
Method: They benchmark the curve of TPOT vs. Batch Size and Throughput vs. Batch Size.
Selection: They identify the maximum batch size that satisfies the target TPOT constraint and use the corresponding throughput value as the effective $TP_{decode}$ .

3. Key Contributions

Theoretical Resource Model: Established a mathematical framework to calculate the exact count of P/D instances based on total throughput, request characteristics ( $L_{in}, L_{out}$ ), and SLOs.
M/M/1 Queuing for Prefill: Introduced a queuing-theory-based method to derive effective prefill throughput from target TTFT and maximum benchmarked performance, moving beyond simple peak throughput assumptions.
Empirical Decode Optimization: Proposed a benchmark-driven approach to determine the optimal decode batch size that maximizes throughput while strictly adhering to TPOT limits.
Validation: Demonstrated that this hybrid method accurately predicts resource allocation in real-world scenarios, balancing cost-efficiency and SLO compliance.

4. Experimental Results

The method was validated using DeepSeek-V3.1-Terminus on NVIDIA H200 GPUs with the following constraints:

Target SLOs: TTFT $\le$ 2s, TPOT $\le$ 20ms.
Workload: Mean input 6144 tokens, output 512 tokens, Total Throughput 5 Million Tokens Per Minute (M TPM).

Findings:

Calculated Allocation: The method predicted a 3 Prefill : 4 Decode (3P4D) instance ratio.
Performance Verification:
- The 3P4D deployment achieved the target 5 M TPM throughput while meeting both TTFT and TPOT SLOs (specifically hitting ~4.8 M TPM with full SLO compliance).
- Comparison (3P3D): A balanced 3P3D deployment failed to meet the throughput target under SLO constraints, capping at ~3.6 M TPM due to TPOT bottlenecks.
Efficiency: The 3P4D configuration yielded a higher per-node throughput (0.69 M TPM/node) compared to 3P3D (0.6 M TPM/node), proving the method's ability to maximize resource utilization.

5. Significance

This paper addresses a critical gap in the operationalization of disaggregated LLM inference. By providing a deterministic, SLO-aware methodology for resource allocation, it enables:

Cost Reduction: Prevents over-provisioning of expensive GPU resources.
Performance Guarantee: Ensures that user-defined latency constraints (TTFT/TPOT) are met without trial-and-error deployment.
Scalability: Offers a reproducible framework for cloud providers and enterprises to deploy LLMs efficiently, adaptable to different models and hardware configurations.

The authors suggest future work could integrate this with simulation-based tools (like AIConfigurator) and extend the logic to multimodal EPD (Embedding-Prefill-Decode) disaggregation systems.

SLO-Aware Compute Resource Allocation for Prefill-Decode Disaggregated LLM Inference

The Big Picture: The "Assembly Line" Problem

The Paper's Goal: The "Goldilocks" Calculator

Step 1: The Theory (The Traffic Light Model)

Step 2: The Real-World Test (The Running Track)

Step 3: The Final Recipe

Why This Matters

Summary

1. Problem Statement

2. Methodology

A. Theoretical Resource Calculation Model

B. Determining Effective Prefill Throughput (Under TTFT Constraints)

C. Determining Effective Decode Throughput (Under TPOT Constraints)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The fourth known primitive solution to a5+b5+c5+d5=e5a^5 + b^5 + c^5 + d^5 = e^5a5+b5+c5+d5=e5

Waring-Goldbach problems for one square and higher powers

Reductification of parahoric group schemes

Sobolev regularity of the symmetric gradient of solutions to a class of ϕ\phiϕ-Laplacian systems

On the approximation of Weierstrass function via superoscillations

The fourth known primitive solution to $a^5 + b^5 + c^5 + d^5 = e^5$

Sobolev regularity of the symmetric gradient of solutions to a class of $\phi$ -Laplacian systems