RTP-LLM: High-Performance Alibaba LLM Inference Engine

RTP-LLM is a high-performance, open-source inference engine deployed at Alibaba Group that achieves superior throughput and latency reductions compared to vLLM and SGLang through integrated optimizations like Prefill-Decode Disaggregation, hierarchical KV cache management, and modular speculative decoding.

Original authors: Boyu Tan, Jiarui Guo, Zongwei Lv, Hanbo Sun, Tong Yang, Kan Liu, Xinfei Shi, Zetao Hu, Yaxin Yu, Chi Zhang, Jianning Zhang, Xi Yang, Wei Zhang, Bo Cai, Silu Zhou, Xiyu Wang, Na He, Yinghao Yu, Wending
Published 2026-05-29
📖 5 min read🧠 Deep dive

Original authors: Boyu Tan, Jiarui Guo, Zongwei Lv, Hanbo Sun, Tong Yang, Kan Liu, Xinfei Shi, Zetao Hu, Yaxin Yu, Chi Zhang, Jianning Zhang, Xi Yang, Wei Zhang, Bo Cai, Silu Zhou, Xiyu Wang, Na He, Yinghao Yu, Wending Bao, Guiyang Huang, Yuxing Yuan, Juncheng Yin, Nan Wang, Lin Yang, Zechao Zhang, Lu Chen, Guoding Li, Tao Lan, Lin Qu

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you run a massive, super-fast library where people come in to ask complex questions. The "Librarian" is a giant AI brain (the Large Language Model) that knows everything. But there's a problem: the library is so big, and the questions are so varied, that the old way of running the library is slow, expensive, and often leaves the Librarian waiting around doing nothing while the shelves are being organized.

The paper introduces RTP-LLM, a brand new, high-speed operating system for this library, built by Alibaba. It's designed to handle millions of customers at once without breaking a sweat. Here is how it works, using simple analogies:

1. The "Prefill" vs. "Decode" Split (The Assembly Line)

In a traditional library, the Librarian has to do two very different jobs at the same time:

  • Job A (Prefill): Reading a huge, complex question all at once to understand it. This is like a heavy lifting job (compute-heavy).
  • Job B (Decode): Answering the question one word at a time. This is like a delicate, memory-heavy job where you have to remember everything you just said.

The Problem: Doing both on the same desk causes traffic jams. The heavy lifting slows down the delicate word-by-word answering.
The RTP-LLM Solution: They built a two-station assembly line.

  • Station 1 (Prefill Node): A dedicated team of strong workers who only read the questions and get them ready.
  • Station 2 (Decode Node): A dedicated team of fast typists who only write the answers word-by-word.
    By separating them, Station 1 can process huge batches of questions quickly, while Station 2 can focus entirely on speed. They don't get in each other's way.

2. The "Smart Filing Cabinet" (KV Cache Management)

When the Librarian answers a question, they have to remember every word they've said so far to keep the conversation making sense. This memory is called the KV Cache.

  • The Problem: As conversations get longer, this memory pile grows so big it fills up the entire room, forcing the Librarian to throw things away or slow down to find them.
  • The RTP-LLM Solution: They built a multi-layered, smart filing system.
    • Layer 1 (GPU Memory): The most important, frequently used notes are on the Librarian's immediate desk (super fast).
    • Layer 2 (Local CPU Memory): Less urgent notes are in a drawer right next to the desk.
    • Layer 3 (Remote Memory): Older notes are in a warehouse across the hall, reachable by a high-speed tube (RDMA).
    • Layer 4 (Cloud Storage): The deepest archives are in a massive cloud warehouse.
    • The Magic Trick: If two people ask similar questions (e.g., "Write a story about a cat"), the system realizes, "Hey, we already have the first half of that story in the filing cabinet!" It reuses those notes instead of rewriting them from scratch. This saves a massive amount of time and space.

3. The "Speed Reading" Shortcut (Speculative Decoding)

Usually, the Librarian writes one word, checks it, writes the next, checks it, and so on. This is slow.
The RTP-LLM Solution: They use a Speed Reader Assistant.

  • The Assistant guesses the next 3 or 4 words the Librarian might say.
  • The Librarian quickly checks all those guesses at once.
  • If the guesses are right, the Librarian accepts all of them instantly. If not, they just correct the one that was wrong.
    This turns a slow, step-by-step process into a fast, batched process, making the library much faster.

4. The "Express Delivery" for New Books (Model Loading)

When the library gets a new, massive encyclopedia (a new AI model with hundreds of billions of pages), loading it onto the shelves used to take hours.
The RTP-LLM Solution: They changed how the books are unpacked.

  • Old Way: Every worker tried to read the whole book to find their specific pages, causing a traffic jam at the door.
  • New Way: The books are organized by the order they arrive on the truck. One worker grabs a whole box, reads it, and passes the pages down a human chain to everyone else simultaneously.
    This cuts the time to load a giant model from hours down to just minutes, allowing the library to swap out its knowledge base instantly.

5. The "Traffic Cop" (Scheduling)

With millions of people asking questions, some are short ("What's the weather?") and some are huge ("Analyze this 100-page legal document").
The RTP-LLM Solution: A smart Traffic Cop directs the flow.

  • If a short question comes in, it gets sent to a fast lane.
  • If a long question comes in, it gets grouped with similar long questions so the workers can handle them efficiently together.
  • The cop constantly checks who is busy and who is free, ensuring no worker sits idle while others are overwhelmed.

The Results

The paper tested this system against other popular library systems (vLLM and SGLang) using real-world data from Alibaba's own apps (like Taobao and Tmall).

  • Loading Speed: It loaded new models 4.7 to 6.3 times faster.
  • Waiting Time: Customers got their first answer 35-40% faster.
  • Memory: It reused old notes 215% more effectively, meaning they needed fewer computers to do the same job.
  • Throughput: It could handle 1.8 to 2.5 times more complex tasks (like looking at images and writing text) at once.

In short, RTP-LLM is a complete overhaul of how AI libraries are run, turning a chaotic, slow process into a streamlined, high-speed factory that can serve over 100 million users efficiently.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →