Original authors: Boyu Tan, Jiarui Guo, Zongwei Lv, Hanbo Sun, Tong Yang, Kan Liu, Xinfei Shi, Zetao Hu, Yaxin Yu, Chi Zhang, Jianning Zhang, Xi Yang, Wei Zhang, Bo Cai, Silu Zhou, Xiyu Wang, Na He, Yinghao Yu, Wending

Published 2026-05-29

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Boyu Tan, Jiarui Guo, Zongwei Lv, Hanbo Sun, Tong Yang, Kan Liu, Xinfei Shi, Zetao Hu, Yaxin Yu, Chi Zhang, Jianning Zhang, Xi Yang, Wei Zhang, Bo Cai, Silu Zhou, Xiyu Wang, Na He, Yinghao Yu, Wending Bao, Guiyang Huang, Yuxing Yuan, Juncheng Yin, Nan Wang, Lin Yang, Zechao Zhang, Lu Chen, Guoding Li, Tao Lan, Lin Qu

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you run a massive, super-fast library where people come in to ask complex questions. The "Librarian" is a giant AI brain (the Large Language Model) that knows everything. But there's a problem: the library is so big, and the questions are so varied, that the old way of running the library is slow, expensive, and often leaves the Librarian waiting around doing nothing while the shelves are being organized.

The paper introduces RTP-LLM, a brand new, high-speed operating system for this library, built by Alibaba. It's designed to handle millions of customers at once without breaking a sweat. Here is how it works, using simple analogies:

1. The "Prefill" vs. "Decode" Split (The Assembly Line)

In a traditional library, the Librarian has to do two very different jobs at the same time:

Job A (Prefill): Reading a huge, complex question all at once to understand it. This is like a heavy lifting job (compute-heavy).
Job B (Decode): Answering the question one word at a time. This is like a delicate, memory-heavy job where you have to remember everything you just said.

The Problem: Doing both on the same desk causes traffic jams. The heavy lifting slows down the delicate word-by-word answering.
The RTP-LLM Solution: They built a two-station assembly line.

Station 1 (Prefill Node): A dedicated team of strong workers who only read the questions and get them ready.
Station 2 (Decode Node): A dedicated team of fast typists who only write the answers word-by-word.
By separating them, Station 1 can process huge batches of questions quickly, while Station 2 can focus entirely on speed. They don't get in each other's way.

2. The "Smart Filing Cabinet" (KV Cache Management)

When the Librarian answers a question, they have to remember every word they've said so far to keep the conversation making sense. This memory is called the KV Cache.

The Problem: As conversations get longer, this memory pile grows so big it fills up the entire room, forcing the Librarian to throw things away or slow down to find them.
The RTP-LLM Solution: They built a multi-layered, smart filing system.
- Layer 1 (GPU Memory): The most important, frequently used notes are on the Librarian's immediate desk (super fast).
- Layer 2 (Local CPU Memory): Less urgent notes are in a drawer right next to the desk.
- Layer 3 (Remote Memory): Older notes are in a warehouse across the hall, reachable by a high-speed tube (RDMA).
- Layer 4 (Cloud Storage): The deepest archives are in a massive cloud warehouse.
- The Magic Trick: If two people ask similar questions (e.g., "Write a story about a cat"), the system realizes, "Hey, we already have the first half of that story in the filing cabinet!" It reuses those notes instead of rewriting them from scratch. This saves a massive amount of time and space.

3. The "Speed Reading" Shortcut (Speculative Decoding)

Usually, the Librarian writes one word, checks it, writes the next, checks it, and so on. This is slow.
The RTP-LLM Solution: They use a Speed Reader Assistant.

The Assistant guesses the next 3 or 4 words the Librarian might say.
The Librarian quickly checks all those guesses at once.
If the guesses are right, the Librarian accepts all of them instantly. If not, they just correct the one that was wrong.
This turns a slow, step-by-step process into a fast, batched process, making the library much faster.

4. The "Express Delivery" for New Books (Model Loading)

When the library gets a new, massive encyclopedia (a new AI model with hundreds of billions of pages), loading it onto the shelves used to take hours.
The RTP-LLM Solution: They changed how the books are unpacked.

Old Way: Every worker tried to read the whole book to find their specific pages, causing a traffic jam at the door.
New Way: The books are organized by the order they arrive on the truck. One worker grabs a whole box, reads it, and passes the pages down a human chain to everyone else simultaneously.
This cuts the time to load a giant model from hours down to just minutes, allowing the library to swap out its knowledge base instantly.

5. The "Traffic Cop" (Scheduling)

With millions of people asking questions, some are short ("What's the weather?") and some are huge ("Analyze this 100-page legal document").
The RTP-LLM Solution: A smart Traffic Cop directs the flow.

If a short question comes in, it gets sent to a fast lane.
If a long question comes in, it gets grouped with similar long questions so the workers can handle them efficiently together.
The cop constantly checks who is busy and who is free, ensuring no worker sits idle while others are overwhelmed.

The Results

The paper tested this system against other popular library systems (vLLM and SGLang) using real-world data from Alibaba's own apps (like Taobao and Tmall).

Loading Speed: It loaded new models 4.7 to 6.3 times faster.
Waiting Time: Customers got their first answer 35-40% faster.
Memory: It reused old notes 215% more effectively, meaning they needed fewer computers to do the same job.
Throughput: It could handle 1.8 to 2.5 times more complex tasks (like looking at images and writing text) at once.

In short, RTP-LLM is a complete overhaul of how AI libraries are run, turning a chaotic, slow process into a streamlined, high-speed factory that can serve over 100 million users efficiently.

Technical Summary: RTP-LLM

Problem Statement

The rapid scaling of Large Language Models (LLMs) to hundreds of billions of parameters has exposed fundamental bottlenecks in existing inference systems, creating a chasm between model capability and deployability. Traditional systems struggle with four primary challenges:

Underutilized GPUs: The autoregressive nature of LLMs creates sequential bottlenecks where memory-bound decode phases leave compute units idle, while static batching fails to adapt to highly variable request patterns (input/output lengths).
Memory Exhaustion: The Key-Value (KV) cache grows linearly with sequence length and batch size, becoming the dominant memory consumer and a hard capacity constraint for concurrency, especially with contexts exceeding 128K tokens.
System Rigidity: Existing frameworks lack efficient support for architectural heterogeneity, including massive Mixture-of-Experts (MoE) models requiring complex routing and multimodal models combining vision encoders with language models.
Operational Fragility: Enterprise deployments require rapid model iteration (minute-level loading for 600B+ parameter models) and robust fault tolerance, which current systems often lack, leading to hours-long loading times and poor SLO adherence under fluctuating loads.

Methodology

RTP-LLM is a holistic, production-ready inference engine developed by Alibaba's Foundation Model Inference Team. It addresses the aforementioned challenges through an integrated system design featuring the following core components:

1. Optimized Model Loading

To address rapid iteration, RTP-LLM shifts from a model-structure-driven loading paradigm to a file-order-driven approach.

File-Order-Driven I/O: Instead of each tensor parallel process reading all files, processes iterate through model files sequentially, loading all tensors from a file before moving to the next. This maximizes FUSE prefetching efficiency.
Hybrid Distributed Reading: Utilizing a hybrid of fastsafetensors and PyTorch distributed broadcast, each file is read by a single process and shared, eliminating redundant I/O.
Shared Memory Reuse: A custom interface reuses a single shared memory buffer across file reads to avoid the high overhead of repeated pinned memory allocation.
I/O-Communication Overlap: File reading and tensor broadcasting are parallelized to overlap I/O and communication latency.

2. Prefill-Decode (PD) Disaggregation

RTP-LLM physically decouples the compute-intensive Prefill phase (processing input prompts) from the memory-bound Decode phase (generating tokens) onto dedicated nodes.

Independent Scaling: Prefill nodes are optimized for large-batch throughput, while Decode nodes are optimized for low-latency memory access.
Dynamic Traffic Scheduling: A central Master node performs intelligent load balancing, grouping requests by sequence length and utilizing predictive scheduling based on queue states and estimated completion times.
KV Cache Affinity: For decode requests, the system routes traffic to workers holding the relevant chat session's KV cache to maximize locality.

3. Hierarchical Multi-Tiered KV Cache Management

To mitigate memory constraints, RTP-LLM implements a four-tier cache hierarchy:

GPU Memory (BlockCache): Fastest access for active blocks.
Local CPU Memory: Serves as a spill-over for GPU memory.
Remote CPU Memory: Accessed via high-speed RDMA.
Distributed Storage (3FS): Persistent storage for long-term caching.

Unified Hash-Based Prefix Matching: A unified hash map aggregates cache keys from all workers, enabling $O(B)$ complexity for prefix matching (where $B$ is the number of blocks) rather than $O(B \times W)$ . This allows efficient reuse of KV cache pages across requests sharing common prefixes (e.g., system prompts).
Sampled Prefix Hashing: To balance granularity and overhead, the system uses sampled hashing for large blocks, creating multiple hash entries at regular intervals.

4. Advanced Inference Optimizations

Modular Speculative Decoding: Supports multiple algorithms (Medusa, Eagle, Prompt Lookup, MTP) via a C++-based framework. It decouples token proposal, scoring, and verification, enabling parallel verification of multiple future tokens.
Adaptive KV Cache Quantization: Implements on-the-fly quantization of KV caches (FP16/BF16 to INT8/INT4/FP8) to reduce memory footprint and bandwidth pressure, particularly for long-context workloads.
Decoupled Multimodal Processing: For vision-language models (e.g., LLaVA, Qwen-VL), the Vision Transformer (ViT) and LLM are deployed separately. This allows the ViT to process images while the LLM generates text, enabling computation overlap and reducing single-device memory footprints.
Multi-Level Parallelism: Integrates Tensor, Pipeline, Data, and Expert Parallelism to support dense and MoE models (up to 600B+ parameters).

Key Contributions

System Design: The design and implementation of RTP-LLM, a cohesive platform integrating memory management, scheduling, and hardware acceleration, proven in production serving over 100 million users.
Engineering Practices: Novel strategies including hierarchical load balancing for disaggregated serving, unified multi-modal orchestration, and adaptive resource allocation based on live traffic analysis.
Comprehensive Evaluation: Extensive benchmarks across diverse architectures (dense, MoE, multimodal) using both controlled benchmarks and real production workloads, offering insights into the interaction of optimization techniques.
Open Source: The release of RTP-LLM as open-source software to foster community innovation.

Experimental Results

Evaluations were conducted on models ranging from 8B to 235B parameters, comparing RTP-LLM against vLLM and SGLang.

Model Loading: Achieved 4.7x–6.3x speedup in loading times for large models (e.g., 600B+ parameters) compared to baselines, enabling minute-level deployment.
Traffic Scheduling & Cache Reuse: In production traffic, RTP-LLM reduced TTFT P95 latency by 35–37% and improved cache reuse by 215%, allowing a 75% reduction in the number of prefill machines required.
Speculative Decoding: Delivered 1.12x–2.48x throughput improvements in speculative decoding scenarios.
Multimodal Inference: Achieved 1.86x–2.52x throughput improvement and 2.12x–2.36x TTFT reduction for multimodal models via decoupled ViT-LLM processing.
Quantized Inference: Reduced batch latency by 35–40% and improved TTFT by 1.9x–3.0x in quantized inference settings while maintaining competitive precision (PPL).
PD Disaggregation: For a 480B MoE model, RTP-LLM achieved a 4.72x–5.33x TTFT speedup and a 1.57x–2.36x improvement in cache hit rates compared to SGLang and vLLM.

Significance

The paper positions RTP-LLM as a comprehensive solution for industrial-scale LLM deployment, bridging the gap between theoretical model capabilities and practical system constraints. By addressing the full inference stack—from I/O bottlenecks in model loading to memory management in long contexts and scheduling in heterogeneous environments—RTP-LLM demonstrates that fundamental architectural rethinking, rather than incremental optimization, is necessary for production viability. Its open-source release aims to provide a foundational framework for future research and development in high-performance LLM inference.

RTP-LLM: High-Performance Alibaba LLM Inference Engine