SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

Imagine you run a massive, high-tech bakery that specializes in baking thousands of different types of custom cakes. In the world of Artificial Intelligence, these "cakes" are Large Language Models (LLMs) designed for specific jobs: one is a math genius, another is a coding wizard, and a third is a legal expert.

The Problem: The "Specialized Kitchen" Bottleneck

Currently, most bakeries operate like this:

The Prep Station (Prefill): This is where the chef reads the customer's order (the prompt) and gathers all the ingredients. It's fast and requires a lot of chopping (computing power).
The Baking Station (Decode): This is where the cake is actually baked, one layer at a time. It's slow, requires constant attention, and uses a lot of oven space (memory).

In the old way of doing things (called Disaggregated Serving), the bakery separates these stations to avoid chaos. But here's the catch: Every single cake type gets its own dedicated Baking Station.

If you have 100 orders for the "Math Cake" and only 2 orders for the "Legal Cake," the Math Baking Station is screaming for help, while the Legal Baking Station sits empty, doing nothing. The oven is underutilized, the electricity bill is huge, and customers waiting for the Math Cake are stuck in a long line. This is the Inter-Model Isolation problem.

The Solution: SUN (Shared Use of Next-token Prediction)

The researchers at NAVER Cloud proposed a brilliant new way to run the bakery called SUN.

Think of SUN as realizing that baking a cake is actually the same process, no matter what flavor it is. Whether you are baking a chocolate cake or a vanilla cake, the oven, the temperature, and the timing are identical. The only difference is the ingredients you put in at the start.

SUN splits the process into two distinct steps:

The Custom Prep (Prefill Module): This part is unique for every cake. The Math Cake needs a special math-flavored batter; the Legal Cake needs a law-flavored batter. In SUN, we keep these prep stations separate and customized.
The Universal Oven (Decode Module): This is the magic. SUN freezes the "baking instructions" into a single, universal module. Once the custom batter is ready, every single cake goes into the same shared oven.

How It Works in Real Life

Imagine a busy kitchen with 4 ovens.

Before (Old Way): You assign one oven to Math, one to Coding, one to Law, and one to Writing. If the Math orders spike, that oven is overwhelmed. The Law oven sits cold.
After (SUN): You have 4 ovens, but they are all shared.
- The Math Prep station finishes its batter and shouts, "Ready for baking!"
- The Coding Prep station finishes its batter and shouts, "Ready!"
- The Law Prep station finishes its batter and shouts, "Ready!"
- The Scheduler: Instead of sending the Math batter to the "Math Oven," it sends it to the first available oven. The Law batter goes to the next available oven.

This means you can shut down 2 of your ovens and still bake everything just as fast because the remaining ovens are working at 100% capacity, switching between Math and Law cakes instantly.

The Secret Sauce: "Prefill-Only Tuning"

You might ask: "If I use the same oven for everything, won't the Math cake taste like a Law cake?"

That was the big fear. If you just mix the ingredients, the result is garbage. The researchers solved this with a clever trick called Prefill-Only Tuning.

They trained the Prep Stations (the custom batters) to be perfectly compatible with the Universal Oven. They didn't change the oven; they just taught the prep chefs how to mix their ingredients so that the universal oven knows exactly how to bake them.

Result: The Math cake tastes exactly like a Math cake, and the Law cake tastes like a Law cake, but they all come out of the same oven.

The Bonus: QSUN (The Energy-Saving Mode)

The researchers also realized that the Universal Oven is the most expensive part to run. So, they created QSUN.

Think of this as putting the oven on "Eco Mode." They lowered the precision of the oven's temperature sensors (quantization). Usually, this makes the cake taste bad. But because they re-trained the Prep Stations (the batters) to work perfectly with this "Eco Oven," the cakes still taste amazing!

Benefit: The oven runs 45% faster and uses less energy, without sacrificing quality.

Why This Matters

Cheaper: You need fewer GPUs (ovens) to serve the same number of people.
Faster: No more waiting in line for a specific oven. If the "Math" oven is busy, your "Law" request jumps the queue to an empty "Math" oven.
Smarter: It handles "skewed" workloads perfectly. If everyone suddenly wants Math cakes, the system automatically shifts all the empty ovens to handle the Math rush, rather than letting the Law ovens sit idle.

In summary: SUN stops us from building a separate kitchen for every single type of AI model. Instead, it builds one super-efficient, shared kitchen where the "baking" part is universal, and only the "preparation" is custom. This saves money, saves energy, and gets your AI answers faster.

1. Problem Statement

The paper addresses the inefficiencies inherent in multi-LLM disaggregated serving, specifically focusing on inter-model isolation.

Context: Modern deployments often serve dozens or hundreds of specialized LLMs (e.g., for math, coding, or agentic workflows) concurrently. To mitigate intra-model interference (between compute-bound prefill and memory-bound decode phases), systems use disaggregated serving, separating prefill and decode onto different GPUs.
The Core Issue: While disaggregation solves intra-model interference, it creates inter-model isolation. In conventional setups, decode workers are partitioned per model.
- Skewed Workloads: Real-world traffic is highly skewed (e.g., a few popular models receive most requests, while others are idle).
- Resource Fragmentation: Dedicated decode GPUs for low-traffic models remain underutilized (often <20% utilization for memory-bound tasks), while high-traffic models face backlogs.
- Inability to Batch: Cross-model batching is impossible because each model has unique parameters, preventing the consolidation of decode requests onto a shared pool of GPUs.
Goal: The authors aim to enable cross-model sharing of decode execution to maximize GPU utilization and reduce Total Cost of Ownership (TCO) without sacrificing model accuracy.

2. Methodology: SUN (Shared Use of Next-token Prediction)

SUN proposes a novel architecture that decomposes a decoder-only Transformer into two distinct modules and introduces a specific training strategy to enable sharing.

A. Model Decomposition

The authors formally decompose the LLM inference process:

Prefill Module ( $P_{\theta_p}$ ): Processes the input prompt in parallel to generate the initial Key-Value (KV) cache ( $C_X$ ) and the first token distribution. This phase is compute-bound.
Decode Module ( $D_{\theta_d}$ ): Generates subsequent tokens autoregressively using the KV cache. This phase is memory-bound.

B. Prefill-Only Tuning

To enable a single frozen decode module to serve multiple task-specific models, SUN employs Prefill-Only Tuning:

Training: For each downstream task, only the prefill parameters ( $\theta_p$ ) are fine-tuned. The decode parameters ( $\theta_d$ ) are kept frozen and shared across all tasks.
Mechanism: The task-specific prefill module learns to produce KV caches ( $C_X$ ) that are specifically adapted to the frozen shared decoder.
Result: This aligns the training distribution with the inference distribution, ensuring that the shared decoder can correctly interpret the KV caches generated by different prefill modules. This eliminates the "train-inference mismatch" that causes accuracy drops in naive sharing attempts.

C. Model-Agnostic Routing

SUN implements a two-stage routing policy:

Prefill Routing: Requests are routed deterministically to the specific task-tuned prefill worker (e.g., math requests go to the math prefill module).
Decode Routing: Once the KV cache is generated, the decode request is dispatched to a shared pool of decode workers in a model-agnostic manner. The system load-balances requests across the shared pool regardless of the originating model, effectively merging requests from different models into a single batch.

D. Quantized SUN (QSUN)

To further optimize the memory-bound decode phase, the authors introduce QSUN:

Quantization: The shared decode module is quantized (e.g., to 4-bit weights) to reduce memory traffic and improve throughput.
Re-tuning: Naive quantization causes accuracy degradation due to representation mismatch. QSUN performs prefill-only re-tuning after quantization. The prefill module is updated to generate KV caches compatible with the low-bit decoder, recovering accuracy while retaining the speed benefits of quantization.

3. Key Contributions

Robust Cross-Model Decode Sharing: SUN is the first approach to eliminate inter-model isolation in disaggregated serving by sharing a single frozen decode module across diverse models via prefill-only tuning.
Model-Agnostic Routing for High Utilization: By decoupling decode routing from model identity, SUN enables effective load balancing. Experiments show it maintains system throughput with 50% fewer decode GPUs compared to conventional per-model partitioning.
Accuracy-Preserving Quantized Decoding: QSUN combines weight-only quantization with prefill re-tuning, achieving a 45% speedup in Time-Per-Output-Token (TPOT) while recovering near full-precision accuracy.

4. Experimental Results

The authors evaluated SUN on LLaMA3.1-8B and Qwen3 models (1.7B/8B/14B) across math, coding, and tool-calling tasks using a vLLM-based disaggregated setup on 8x NVIDIA A100 GPUs.

Accuracy:
- SUN achieves accuracy comparable to full fine-tuning (Full-FT) across all tasks. For example, on LLaMA3.1-8B, SUN matched Full-FT on GSM8K (70.6% vs 71.3%) and even improved HumanEval scores (49.4% vs 48.2%).
- QSUN recovers accuracy lost by quantization. On LLaMA3.1-8B, naive quantization (SUN+AWQ-d) dropped GSM8K accuracy to 11.4%, but QSUN restored it to 70.3%, nearly matching the 16-bit baseline.
System Throughput & Efficiency:
- Throughput per GPU: SUN improves throughput per GPU by up to 2.0× over conventional disaggregated baselines under skewed workloads.
- Latency: SUN keeps Time-Per-Output-Token (TPOT) degradation within 5% while consolidating decode resources.
- QSUN Performance: QSUN reduces TPOT by 45% compared to Full-FT and outperforms standard AWQ quantization in both accuracy and Time-To-First-Token (TTFT) because the prefill phase remains in full precision.
Robustness to Skew:
- Under highly skewed workloads (Zipf distribution $\alpha=3.0$ ), conventional baselines suffer from severe interactivity degradation and throughput drops.
- SUN maintains stable throughput and interactivity, improving system throughput by 29.2% and interactivity by 44.1% compared to the skewed baseline.

5. Significance

Economic Impact: By enabling cross-model batching and resource consolidation, SUN significantly reduces the number of GPUs required for multi-LLM serving, lowering the Total Cost of Ownership (TCO) and energy consumption.
Scalability: It solves a critical bottleneck in the emerging "Agentic AI" paradigm, where a single user query may trigger a pipeline of multiple specialized models. SUN allows these diverse models to share infrastructure efficiently.
Technical Innovation: The concept of Prefill-Only Tuning offers a new paradigm for model adaptation, separating the "context understanding" (prefill) from the "generation logic" (decode), allowing for flexible, shared inference backends.
Practicality: The integration with existing frameworks (vLLM) and the demonstration of quantization benefits make SUN a viable, deployable solution for production environments.