xLLM Technical Report

Tongxuan Liu, Tao Peng, Peijun Yang, Xiaoyang Zhao, Xiusheng Lu, Weizhe Huang, Zirui Liu, Xiaoyu Chen, Zhiwei Liang, Jun Xiong, Donghe Jin, Minchao Zhang, Jinrong Guo, Yingxu Deng, Xu Zhang, Xianzhe Dong, Siqi Wang, Siyu Wu, Yu Wu, Zihan Tang, Yuting Zeng, Yanshu Wang, Jinguang Liu, Meng Kang, Menxin Li, Yunlong Wang, Yiming Liu, Xiaolong Ma, Yifan Wang, Yichen Zhang, Jinrun Yin, Keyang Zheng, Jiawei Yin, Jun Zhang, Ziyue Wang, Xiaobo Lin, Liangyu Liu, Liwei Lan, Yang Liu, Chunhua Peng, Han Liu, Songcheng Ren, Xuezhu Wang, Yunheng Shen, Yi Wang, Guyue Liu, Yitao Hu, Hui Chen, Tong Yang, Hailong Yang, Jing Li, Guiguang Ding, Ke Zhang

Published 2026-03-04

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

Imagine you run a massive, high-speed restaurant called xLLM. This restaurant doesn't just serve food; it serves "thoughts" generated by giant AI brains (Large Language Models) to millions of hungry customers at once.

In the past, running this restaurant was a nightmare. The kitchen was chaotic, the waiters were confused, and the expensive ovens (AI accelerators) were often sitting idle while the chefs waited for orders.

Here is how xLLM fixes the restaurant, explained through simple analogies:

1. The Big Idea: Separating the "Front of House" from the "Kitchen"

Most AI systems try to do everything in one big room. xLLM splits the restaurant into two distinct teams:

xLLM-Service (The Front of House): This is the manager and the waiters. They decide who gets served, when, and where. They handle the chaos of the crowd.
xLLM-Engine (The Kitchen): This is the chefs and the ovens. They focus purely on cooking the food (processing the data) as fast and efficiently as possible.

By separating these two, the managers can rearrange the dining room without stopping the chefs from cooking.

2. Solving the "Tidal Wave" Problem (Online vs. Offline)

The Problem: Imagine your restaurant gets flooded with customers at lunch (Online requests: chatbots, customer service) but is empty at 3 AM. Meanwhile, you have a slow, non-urgent task like "cleaning the windows" (Offline requests: data analysis).

Old Way: You hire extra staff for lunch and fire them at night. The "window cleaning" staff sits idle all day because they can't help during the rush.
xLLM's Way: You have a Smart Scheduler.
- At lunch, the "window cleaners" help serve tables.
- If a VIP customer (a critical chatbot user) arrives, the window cleaner immediately stops cleaning and helps the VIP.
- When the rush dies down, the VIPs leave, and the window cleaners go back to their slow tasks.
- Result: No one is ever sitting around doing nothing. The kitchen is always full.

3. The "Split Kitchen" Strategy (PD Disaggregation)

The Problem: Cooking a meal has two steps: Prep (chopping veggies, reading the recipe) and Cooking (frying, baking).

In a traditional kitchen, one chef does both for one dish. If the prep takes 10 minutes, the stove sits idle. If the cooking takes 10 minutes, the chopping board sits idle.
xLLM's Way: They split the kitchen into two specialized zones:
- The Prep Zone: A team of chefs who only chop and read recipes.
- The Cooking Zone: A team of chefs who only fry and bake.
The Magic: If the Prep Zone is busy but the Cooking Zone is free, xLLM can instantly move a "Cooking" chef to help with "Prep" (and vice versa). It's like having a flexible workforce that can swap roles instantly without changing their aprons. This ensures the stove is never cold and the chopping board is never empty.

4. Handling "Multimedia" Orders (Images + Text)

The Problem: Sometimes a customer orders a complex dish that requires looking at a picture of the food and reading a text description.

Old Way: The chef looks at the picture, then reads the text, then cooks. It's slow.
xLLM's Way: They use Dual-Stream Parallelism.
- One chef looks at the picture (Image Encoder).
- Another chef reads the text (Text Encoder).
- They work at the same time, then meet up to cook. It's like having two assembly lines working in parallel instead of one.

5. The "Smart Fridge" (Memory Management)

The Problem: AI models need to remember everything they've said in a conversation (the "Context"). As conversations get longer, the fridge (memory) runs out of space.

Old Way: You either buy a giant fridge that is half-empty (wasting space) or you keep throwing away old food to make room for new food (losing context).
xLLM's Way: They invented xTensor.
- Imagine a fridge where the shelves are logically connected (you think of them as one long shelf) but physically scattered in different corners of the kitchen.
- The system only pulls out the exact shelf space needed for the current sentence. If a conversation ends, that space is instantly snapped back into the pool for the next customer. No wasted space, no lost food.

6. The "Super Chef" Tricks (Engine Optimizations)

Inside the kitchen, xLLM uses some magic tricks to cook faster:

The "Pre-Order" Trick (Speculative Decoding): Instead of waiting to cook one dish at a time, the chef guesses the next 5 dishes the customer might want and starts prepping them. If the guess is right, the food is ready instantly.
The "Assembly Line" (Pipeline): The kitchen doesn't wait for the oven to finish before starting the next step. While the oven is baking Dish A, the chef is chopping for Dish B, and the waiter is plating Dish C. Everything happens at the same time.
The "Traffic Cop" (Load Balancing): If one chef is overwhelmed with orders while another is standing still, the system instantly moves orders to the free chef. It prevents the "slowest chef" from holding up the whole line.

7. The Result: A Supercharged Restaurant

When xLLM was tested against other famous restaurant chains (like MindIE and vLLM):

It served 1.7 to 2.2 times more customers in the same amount of time.
It handled the "rush hour" without anyone getting angry (low latency).
It saved money by using the kitchen equipment much more efficiently.

In short: xLLM is like upgrading a chaotic, small-town diner into a high-tech, self-optimizing Michelin-star restaurant that never stops moving, never wastes space, and always serves the customer exactly what they need, exactly when they need it. And the best part? They shared the blueprints with the whole world (open source) so everyone can build better restaurants!

1. Problem Statement

The paper identifies critical bottlenecks in current Large Language Model (LLM) inference frameworks when deployed in enterprise-grade, large-scale scenarios. The challenges are categorized into two main areas:

Service-Level Challenges:
- Hybrid & Dynamic Workloads: Existing schedulers struggle to handle the "tidal" nature of online traffic (bursty, latency-sensitive) while efficiently utilizing idle resources for offline tasks.
- Static PD Disaggregation: Traditional Prefill-Decode (PD) disaggregation assumes static resource allocation, failing to adapt to dynamic request loads (fluctuating input/output lengths), leading to low accelerator utilization and Service Level Objective (SLO) violations.
- Multimodal Inefficiency: Lack of strategies to efficiently handle multimodal requests (image, voice, text), specifically regarding parallel processing of encoding phases and fine-grained resource allocation.
- Fault Tolerance: As clusters scale, ensuring fast fault detection and recovery without impacting high-priority online requests is difficult.
Engine-Level Challenges:
- Underutilized Hardware: Modern AI accelerators often suffer from "computational bubbles" due to rigid CPU-accelerator dependencies, communication overhead (All-to-All), and architectural mismatches between compute units (e.g., Tensor vs. Vector cores).
- Memory Management: Expanding context windows cause KV Cache memory to grow exponentially. Traditional contiguous allocation wastes memory, while paging mechanisms (like PagedAttention) introduce high overhead and complexity.
- Algorithmic Limitations: Static scheduling in Data Parallelism (DP) and Expert Parallelism (EP) leads to load imbalances and straggler effects, reducing overall throughput.

2. Methodology: xLLM Architecture

xLLM introduces a novel decoupled Service-Engine architecture designed to optimize both cluster management and hardware execution.

A. xLLM-Service (The Service Layer)

Focuses on intelligent scheduling, resource management, and fault tolerance.

Unified Elastic Scheduling: Implements a Online-Offline Co-location Policy. Online requests are prioritized with preemption capabilities, while offline tasks utilize idle resources. The system dynamically migrates offline tasks between Prefill and Decode pools based on traffic peaks to maximize utilization.
Dynamic PD Disaggregation: Replaces static PD ratios with a workload-adaptive policy. It monitors real-time metrics (TTFT, TPOT) and uses a stateless instance design to instantly switch instance roles (Prefill $\leftrightarrow$ Decode) without restarting, ensuring SLO compliance.
Hybrid EPD Disaggregation: For multimodal inputs, it introduces an Encode-Prefill-Decode (EPD) strategy. An "EPD Profiler" determines the optimal disaggregation strategy (e.g., E-P-D, EP-D, or ED-P) based on request characteristics, enabling parallel execution of visual encoding and language generation.
Global KV Cache Management: Utilizes a distributed, multi-level storage architecture (HBM-DRAM-SSD) inspired by Mooncake Store. It supports global KV cache routing and offloading to expand capacity and improve hit rates across the cluster.
Fast Fault Recovery: A specialized architecture for inference that avoids heavy checkpointing. It enables fast request migration and KV cache recomputation/migration to recover from node failures with minimal latency impact.

B. xLLM-Engine (The Engine Layer)

Focuses on full-stack optimization of the inference execution pipeline.

Multi-layer Pipeline Execution:
- Framework Layer: Asynchronous CPU-accelerator scheduling to hide CPU preparation latency behind accelerator computation (using placeholder tokens).
- Model Graph Layer: Dual-stream parallelism splits micro-batches to overlap computation (Attention/Expert) with communication (All-to-All), hiding communication latency.
- Operator Layer: Fine-grained overlapping of Matrix (Cube) and Vector units to maximize hardware utilization.
Adaptive Graph Mode: Solves the issue of variable input shapes (batch size, sequence length) in Graph Mode. It employs Partial Graph Mode and dimension parameterization to pre-compile graphs for dynamic shapes, drastically reducing kernel launch overhead while maintaining flexibility.
xTensor Memory Management: Introduces a "logically contiguous, physically discrete" KV Cache structure. It decouples virtual address spaces from physical pages, allowing on-demand mapping and reuse of physical pages. This resolves the conflict between the need for contiguous memory for efficient kernels and the need for dynamic allocation for variable lengths.
Algorithmic Optimizations:
- Optimized Speculative Decoding: Asynchronous decoding and MLA (Multi-Head Latent Attention) optimizations to reduce data movement.
- Dynamic Load Balancing: Implements Dynamic Expert Parallel (EPLB) and Hierarchical Data Parallel (DP) load balancing. It uses real-time statistics to rebalance expert weights and migrates workloads between DP groups to eliminate stragglers.
Generative Recommendation: Specific optimizations for beam search and host-device overlapping to accelerate recommendation tasks.

3. Key Contributions

Decoupled Architecture: A service-engine split that separates cluster orchestration from hardware execution, allowing independent optimization of both layers.
Adaptive Scheduling Policies:
- Online-Offline Co-location: Maximizes resource utilization without compromising online SLOs.
- Dynamic PD/EPD Disaggregation: Adapts to real-time traffic and multimodal requirements, replacing static resource partitioning.
Full-Stack Engine Optimizations:
- Multi-layer Pipeline: Eliminates computational bubbles via CPU-accelerator overlap and dual-stream communication/computation overlap.
- xTensor Memory: A novel memory management scheme that achieves high memory efficiency and computational performance simultaneously.
- Adaptive Graph Mode: Enables efficient graph execution for dynamic input shapes.
Algorithmic Enhancements: Advanced load balancing for MoE and DP architectures and optimized speculative decoding.

4. Experimental Results

The framework was evaluated on Ascend 910B/910C accelerators using Qwen-series and Deepseek-series models, comparing against MindIE and vLLM-Ascend.

Throughput Improvements:
- Qwen-series: xLLM achieves up to 1.7× higher throughput than MindIE and 2.2× higher than vLLM-Ascend under identical TPOT constraints.
- Deepseek-series: xLLM achieves an average 1.7× throughput improvement over MindIE.
- PD Disaggregation: In DeepSeek-R1 tests, xLLM achieved 34% higher throughput (11,351 vs. 8,476 tokens/s) compared to MindIE.
Business Scenarios:
- JingYan (Shopping Assistant): xLLM showed superior scaling efficiency, delivering ~1.6× the throughput of vLLM-Ascend on Qwen3-8B.
- Customer Service: xLLM on Ascend 910C achieved 3.1× the throughput of vLLM-Ascend for Qwen3-32B.
- Generative Recommendation: Reduced end-to-end latency by 23% compared to MindIE under heavy loads (Beam Width 128).
Ablation Studies:
- Dynamic PD Policy: Improved request serving rates by 1.67× over minimal load strategies on bursty traffic.
- Multi-layer Pipeline: Reduced TPOT and increased throughput by up to 17.4% for smaller models by masking scheduling latency.
- Adaptive Graph Mode: Increased throughput by 27.4% for Qwen3-1.7B.

5. Significance

Enterprise Readiness: xLLM addresses the specific, complex needs of large-scale enterprise deployment (JD.com), including high availability, hybrid workloads, and multimodal support, which open-source frameworks often lack.
Hardware Agnosticism & Optimization: By deeply optimizing for domestic AI accelerators (Ascend) while maintaining a flexible architecture, it provides a blueprint for efficient inference on heterogeneous hardware.
Open Source Impact: The release of xLLM and xLLM-Service as open-source projects aims to democratize high-performance LLM inference, fostering innovation in the AI infrastructure ecosystem.
Scalability: The framework demonstrates near-linear scalability across different model sizes and hardware configurations, making it a viable solution for next-generation, trillion-parameter model serving.

In summary, xLLM represents a significant leap forward in LLM inference infrastructure by moving from static, monolithic designs to a dynamic, decoupled, and algorithmically aware system that maximizes hardware utilization and ensures strict SLO compliance in production environments.