DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

The Big Picture: The "Two-Chef" Kitchen Problem

Imagine you run a high-end restaurant (a Large Language Model, or LLM) that serves complex dishes. Every order goes through two distinct stages:

The Prep Stage (Prefill): The chef reads the order, gathers ingredients, and chops everything. This is hard work (compute-intensive) but happens quickly once started.
The Cooking Stage (Decoding): The chef actually cooks the meal, plate by plate, serving it to the customer. This is memory-intensive (you need lots of fridge space for ingredients) and takes a long time.

The Old Way (The Single Chef):
In the past, one chef tried to do both prep and cooking for every order at the same time. If a customer ordered a massive banquet (a long prompt), the chef spent all their time chopping, and the kitchen backed up. If they ordered a tiny snack, the chef wasted time setting up a huge station. It was inefficient.

The "PD-Disaggregation" Way (The Two-Chef System):
To fix this, modern systems split the kitchen into two teams:

Team A (Prefill Chefs): Only do the chopping and prep. They are fast and strong.
Team B (Decoding Chefs): Only do the cooking and serving. They need lots of fridge space.

This sounds great, but it creates a new problem: The Mismatch.

If you have too many Prep Chefs and not enough Cooking Chefs, the Prep Chefs finish their work and sit around waiting for the Cooking Chefs to catch up. Wasted money.
If you have too many Cooking Chefs and not enough Prep Chefs, the Cooking Chefs stand around with empty hands, waiting for food to arrive. Wasted money.
The Chaos: Customers order wildly different things. Some want a 30-second snack; others want a 3-hour feast. If you set your kitchen staff based on the "average" order, you will always be wrong. Short orders get stuck in a long line, and long orders overwhelm the system.

The Solution: DOPD (The Smart Kitchen Manager)

The authors of this paper created a system called DOPD (Dynamic Optimal Prefill/Decoding). Think of DOPD as a super-smart, predictive Kitchen Manager who never sleeps.

Here is how DOPD solves the problems using three main tricks:

1. The Crystal Ball (Predicting the Future)

Most managers just look at what is happening right now. DOPD looks at what happened recently to guess what will happen next.

The Analogy: Imagine a weather forecaster who knows that if it rains at 2 PM, it usually pours at 3 PM. DOPD uses a mathematical tool (called ARIMA) to predict the "weather" of your requests.
What it does: It guesses: "In the next few minutes, we will get 50 short orders and 5 long orders."
The Result: Instead of waiting for a traffic jam to form, DOPD calls in extra Prep Chefs before the rush starts. This prevents the kitchen from ever getting overwhelmed.

2. The Perfect Ratio (Balancing the Team)

DOPD constantly calculates the Golden Ratio of Prep Chefs to Cooking Chefs.

The Analogy: If the menu changes from "Salads" (short) to "Steaks" (long), the ratio of chopping knives to frying pans needs to change.
What it does: If the system predicts a rush of long requests, it automatically spins up more Prep Chefs. If it's a rush of short requests, it shifts resources to the Cooking side. It ensures that as soon as a Prep Chef finishes chopping, a Cooking Chef is ready to grab the plate. No one is ever standing around idle.

3. The Smart Waiter (Handling Mixed Orders)

This is the most clever part. Sometimes, you get a mix of tiny snacks and huge feasts.

The Analogy: Imagine a waiter who knows that if you put a tiny appetizer in the same batch as a giant turkey, the turkey slows down the appetizer.
What it does: DOPD has a special rule for Ultra-Short Requests. If a request is tiny (like a 100-word prompt), the system realizes: "Hey, sending this to the Prep Chef takes longer than just cooking it right here!" So, it skips the Prep Chef entirely and cooks it immediately on the Cooking side.
The Result: Tiny orders fly through the system instantly, while big orders get the full attention of the Prep team.

Why Does This Matter? (The Results)

The paper tested DOPD against other systems (like vLLM and DistServe) using real-world data from Microsoft Azure. The results were like upgrading from a bicycle to a sports car:

1.5x Faster: The system produced 50% more "good" answers per hour (Goodput).
67% Faster Start: The time it takes to see the first word of an answer (Time-to-First-Token) dropped by two-thirds.
99% Success Rate: Almost every customer got their answer on time (SLO attainment), whereas other systems failed about 20% of the time.
Cheaper: Because the system is so efficient, you need fewer expensive GPUs (computer chips) to do the same amount of work.

Summary

DOPD is like a self-driving kitchen that:

Predicts the rush before it happens.
Adjusts the number of chefs instantly to match the demand.
Sorts orders so tiny ones don't get stuck behind huge ones.

It turns a chaotic, inefficient AI service into a smooth, fast, and cost-effective machine, ensuring that when you ask an AI a question, it answers you quickly and reliably, no matter how busy the server is.

1. Problem Statement

Large Language Model (LLM) inference involves two distinct stages with conflicting resource requirements:

Prefill Stage: Compute-intensive (processing input prompts).
Decoding Stage: Memory-intensive (generating output tokens).

To mitigate interference between these stages, state-of-the-art systems use PD-Disaggregation, separating Prefill (P) and Decoding (D) instances onto different GPUs. However, current PD-Disaggregation systems face three critical challenges:

Producer-Consumer Imbalance: LLM workloads are highly heterogeneous and non-stationary (varying input/output lengths and bursty traffic). Static P/D ratios often lead to a mismatch where P-instances produce requests faster than D-instances can consume them (causing queue buildup) or vice versa (causing GPU idleness).
Resource Inefficiency: Suboptimal P/D ratios result in significant GPU waste (idleness) or Service-Level Objective (SLO) violations (latency spikes).
Mixed-Length Request Issues: Standard static configurations struggle with mixed-length requests. Short requests suffer from unnecessary KV-cache transfer latency between P and D instances, while long requests require different resource balances. Existing dynamic solutions often lack a principled method to calculate the optimal P/D ratio in real-time.

2. Methodology: The DOPD Framework

The authors propose DOPD (Dynamic Optimal Prefill/Decoding), a system that dynamically adjusts the P/D instance ratio and employs intelligent scheduling to maximize goodput.

A. System Architecture

DOPD extends the Dynamo framework and consists of five core components:

Resource Monitor: Collects telemetry (GPU memory, KV-cache usage, queue sizes, TTFT/TPOT) to feed the predictor.
Router: Routes requests to D-instances based on KV-cache hits and current load.
Connector: Manages inter-instance communication, including a prefill queue and high-speed KV-cache transfer (using NIXL).
PD Manager: The brain of the system. It calculates the optimal P/D ratio and triggers elastic scaling.
Request Scheduler: A length-aware scheduler within P and D instances to handle mixed workloads.

B. Key Technical Components

1. Optimal P/D Ratio Calculation
DOPD derives an analytical formula to determine the optimal ratio ( $N_p / N_d$ ) based on workload characteristics:

Modeling: It models the D-instance's maximum concurrency ( $c_{cd}$ ) based on GPU memory capacity and memory bandwidth constraints (ensuring TPOT SLOs are met).
Balance Equation: The system aims to balance the production rate of P-instances with the consumption rate of D-instances.
$N_p \times \frac{t_d \times OSL}{t_p} = N_d \times c_{cd}$
Where $t_p$ is prefill time, $t_d$ is decoding step time, and $OSL$ is output sequence length.
Prediction: It uses an ARIMA-based time-series model to forecast future average input/output sequence lengths and concurrency. A multiplicative scaling factor corrects systematic biases between predictions and actuals.

2. Length-Aware Request Scheduling
To handle mixed-length requests without a global P/D ratio mismatch:

Short Requests: Batching is applied. Short requests are accumulated until a length threshold or timeout is met. This is modeled as a 0-1 Knapsack problem to maximize throughput while minimizing wait time.
Long Requests: Dispatched immediately to P-instances to avoid queuing delays.
Ultra-Short Requests: For extremely short prompts where prefill time is negligible, DOPD employs PD-aggregation, executing the prefill locally on the D-instance to avoid KV-cache transfer overhead.

3. Dynamic Scaling Strategy

Proactive Scaling: Based on ARIMA forecasts, the system pre-allocates resources before a load spike occurs.
Reactive Fallback: If the prefill queue depth or KV-cache occupancy exceeds thresholds (indicating a sudden burst), the system immediately scales instances, bypassing the scheduled interval to prevent SLO violations.

3. Key Contributions

DOPD Framework: An intelligent, dynamic LLM inference system that continuously tunes P-instance and D-instance configurations.
Analytical P/D Ratio: A principled method to compute the optimal P/D ratio for a given load regime, minimizing resource waste while satisfying SLOs.
Length-Aware Scheduling: An algorithm that mitigates resource mismatches caused by mixed-length requests by dynamically batching short requests and aggregating ultra-short requests.
Comprehensive Evaluation: Extensive experiments demonstrating superior performance over both aggregation-based (vLLM) and disaggregation-based (DistServe, Dynamo) baselines.

4. Experimental Results

The authors evaluated DOPD on a cluster of 8 NVIDIA H100 GPUs using models like LLaMa-3.3-70B and OPT-30B with real-world traces (Microsoft Azure, ShareGPT).

Goodput Improvement: DOPD improves overall system goodput by up to 1.5× compared to vLLM and DistServe.
Latency Reduction:
- P90 TTFT (Time to First Token): Reduced by up to 67.5%.
- P90 TPOT (Time per Output Token): Reduced by up to 22.8%.
SLO Attainment: DOPD achieves 99.4% SLO attainment (vs. 80.8% for baselines) under high-concurrency, variable workloads.
Resource Efficiency: DOPD achieves performance comparable to a static 8-GPU aggregated system using only 6 GPUs in optimal configurations, significantly reducing hardware costs.
Dynamic Adaptation: In dynamic tests, DOPD's GPU provisioning closely tracks request rate spikes, whereas reactive baselines (DYN-LOAD, DYN-SLA) lag, leading to higher SLO violations.

5. Significance

This paper addresses a critical bottleneck in the industrial deployment of LLMs: the trade-off between cost (GPU resources) and performance (SLOs) in disaggregated architectures.

Economic Impact: By optimizing the P/D ratio dynamically, DOPD allows service providers to handle higher traffic with fewer GPUs, directly reducing operational costs.
Technical Advancement: It moves beyond static or reactive scaling to proactive, predictive scaling based on workload characteristics, solving the "producer-consumer imbalance" inherent in PD-disaggregation.
Practicality: The system is implemented as an extension of existing frameworks (Dynamo) and is open-sourced, making it a viable solution for real-world production environments.

In summary, DOPD provides a robust, mathematically grounded solution for maximizing the efficiency of LLM inference serving in dynamic, heterogeneous environments.