SLO-Aware Compute Resource Allocation for Prefill-Decode Disaggregated LLM Inference
This paper proposes a hybrid methodology combining theoretical modeling with empirical benchmarking to accurately determine the optimal allocation of Prefill-Decode disaggregated hardware resources for Large Language Model inference while satisfying throughput, SLO, and request characteristic constraints.