Original authors: Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang

Published 2026-06-12

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are running a massive, high-tech kitchen.

The Old Way (Existing Systems):
For a long time, AI models were like simple, single-lane assembly lines. If you wanted to bake a cake (generate text), every ingredient went down the same path: mix, bake, frost, serve. Systems like vLLM were built specifically for this "text-only" assembly line. They were incredibly fast at baking cakes.

But now, AI is entering a new era. We aren't just baking cakes anymore; we are running a composite kitchen. We have models that can:

Look at a photo and describe it (Vision).
Listen to a voice and reply in a different voice (Speech).
Watch a video and predict what happens next (World Models).
Control a robot arm based on a command (Robotics).

These new "composite" models are messy. They don't follow a single straight line. Sometimes they need to loop back and check their work (like a diffusion model refining an image). Sometimes they need to split into three parallel paths at once (like checking a picture from three different angles). Sometimes they need to stream audio in real-time, one word at a time.

The old kitchen systems (vLLM, SGLang) tried to force these complex, multi-task models into their simple, straight-line assembly lines. It was like trying to fit a juggling act into a conveyor belt. It worked, but it was slow, clunky, and required a lot of custom "glue" code to make the pieces stick together.

The New Solution: M*
The paper introduces M, a new operating system for these complex AI kitchens. Instead of forcing models into a straight line, M treats them like a dynamic map or a flowchart.

Here is how M* works, using simple analogies:

1. The "Walk Graph" (The Map)

In the old days, the kitchen manager had to know the exact recipe for every dish. In M*, the model creator just draws a map (a graph) of the kitchen.

Nodes: These are the stations (e.g., "The Vision Station," "The Language Station," "The Audio Station").
Walks: These are the specific routes a customer's order takes.
- Order A (Text-to-Image): Walks from the Language Station → loops 50 times at the Art Station → ends at the Printer.
- Order B (Speech-to-Speech): Walks from the Mic Station → splits into two paths → merges back → goes to the Speaker.

M* calls this the Walk Graph. It allows the system to understand that different tasks take different paths through the same kitchen, without needing to rewrite the kitchen's layout every time.

2. The "Conductor" (The Traffic Controller)

Once the map is drawn, M* has a Conductor. This is the smart traffic controller who doesn't care what the food is, only where it needs to go.

If a request needs to loop back (like refining an image), the Conductor sends it back to the start of the loop.
If a request needs to split into three (like checking safety, style, and content), the Conductor sends three copies to three different chefs simultaneously.
If a request is streaming audio, the Conductor ensures the chef hands over the first word immediately, rather than waiting for the whole sentence to finish.

3. Flexible Seating (Placement)

In the old systems, if you had a big chef (a large AI model), you had to put them on one giant table (one GPU). If you had a small helper (a small encoder), they had to sit at a tiny table.
M* lets you disaggregate the kitchen. You can put the "Vision Chef" on one computer, the "Language Chef" on another, and the "Audio Chef" on a third. They can talk to each other instantly. This means you can scale up just the part of the model that is slow, without wasting money on the parts that are fast.

What Did They Prove? (The Results)

The authors tested M* against the old systems (vLLM-Omni, SGLang-Omni, VoxServe) on five different types of complex models. Here is what happened:

Image Generation (BAGEL): When turning text into images, M* was 20% to 50% faster than the old systems. It handled the complex "loops" needed to refine the image much more efficiently.
Speech (Qwen3-Omni & Orpheus): When generating speech, M* was up to 2.9 times faster in real-time performance and handled 2.7 times more requests at once. It was like switching from a slow, stuttering radio to a crisp, live broadcast.
Robotics (V-JEPA 2): When predicting future video frames for robot planning, M* was up to 12.5 times faster. The old system had to re-calculate everything from scratch every step; M* remembered the previous steps and just updated the new ones.

The Bottom Line

M* is a universal system that stops trying to force every AI model into a single, rigid shape. Instead, it lets the model define its own shape (a graph) and then builds a flexible, high-speed engine to run that shape.

It's the difference between a rigid assembly line that breaks when you try to build a car instead of a toaster, and a smart, adaptable workshop that can build anything from a toaster to a rocket ship with equal speed and efficiency. The paper claims this allows developers to deploy these complex, multi-modal models with minimal effort while getting significantly better performance.

Technical Summary: M* - A Modular, Extensible, Serving System for Multimodal Models

1. Problem Statement

The AI landscape is shifting toward composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion heads, audio codecs, and action generators. These models—including Unified Multimodal Models (UMMs), Omni models, Speech-Language Models (SpeechLMs), Vision-Language-Action (VLA) policies, and World Models—exhibit execution structures that vary significantly based on input modality and task. Unlike traditional text-only Large Language Models (LLMs) that follow a fixed autoregressive (AR) loop, composite models feature:

Diverse execution paths: Different tasks (e.g., image generation vs. image understanding) traverse different subsets of components within the same model.
Complex control flow: Patterns include non-autoregressive loops (e.g., diffusion steps, world model rollouts), internal parallelism (e.g., Classifier-Free Guidance branches), and streaming data dependencies.
Heterogeneous hardware requirements: Components may need to be scaled independently or placed on specific hardware topologies.

Existing serving frameworks (e.g., vLLM, SGLang) are optimized for AR text generation. While recent extensions (vLLM-Omni, SGLang-Omni) attempt to support multimodal models via fixed Directed Acyclic Graphs (DAGs) of "stages," they fail to capture complex patterns like cross-stage loops, internal parallelism, and dynamic streaming. This leads to suboptimal performance, rigid physical placement, and the need for significant custom "glue code" for new architectures.

2. Methodology: The Walk Graph Abstraction

The authors propose M*, a universal serving system built on a core abstraction called the Walk Graph. This abstraction decouples the model architecture from the system runtime.

2.1 Core Definition

A model in M* is defined as a tuple $(G, W)$ :

$G$ (Computation Graph): A directed graph of heterogeneous nodes (e.g., encoders, decoders, transformer layers) and edges (tensor flows).
$W$ (Walks): A finite set of named subgraphs representing specific phases of model behavior (e.g., prefill_text, image_gen). A user request is a sequence of these walks.

2.2 Compositional Primitives

The graph is constructed using four primitives to handle complex control flow:

Sequential: Chaining subgraphs where outputs feed into the next.
Parallel: Fan-out execution of children nodes (e.g., for CFG branches).
Loop: Bounded iteration with accumulated outputs (e.g., diffusion steps).
DynamicLoop: Iteration with per-request early-exit conditions (e.g., End-of-Sequence in AR or variable horizons in world models).

2.3 Streaming and Chunking

To support real-time generation (e.g., speech), M* introduces StreamingGraphEdges parameterized by ChunkPolicies. These policies dictate when a consumer node fires based on accumulated input tensors. Supported policies include:

FixedChunkPolicy: Fires every $K$ items.
SlidingWindowChunkPolicy: Fires when a buffer holds $W$ items, advancing by $S$ .
LeftContextChunkPolicy: Prepends context frames to chunks.

2.4 Runtime Architecture

The M* runtime executes the Walk Graph through a distributed architecture:

Conductor: Maintains per-request state and dispatches work to workers via ZeroMQ.
Workers: One process per GPU rank, executing local subgraphs. They support continuous batching, CUDA-graph capture, and torch.compile.
Engines: Specialized inference instances (e.g., KVCacheEngine for transformers with FlashInfer paged attention, StatelessEngine for simple nodes).
Placement: Users define a mapping from GraphNodes to GPU ranks. This enables flexible strategies like prefill-decode disaggregation, encoder/decoder disaggregation, and independent scaling of components.

3. Key Contributions

Walk Graph Abstraction: A unified representation that captures arbitrary composite model structures (loops, parallelism, streaming) as dataflow graphs, moving beyond the limitations of fixed-stage DAGs.
Modular Runtime: A system that compiles the graph abstraction into high-performance execution, integrating state-of-the-art optimizations (paged attention, CUDA graphs) across modalities without requiring model-specific glue code.
Flexible Physical Placement: A decoupled API allowing deployers to map components to hardware arbitrarily, enabling independent scaling and resource sharing across different request types.
Streaming Support: Native handling of producer-consumer patterns via chunk policies, essential for real-time speech and video generation.

4. Experimental Results

The authors evaluated M* on five representative models: BAGEL (UMM), Qwen3-Omni (Omni), Orpheus (SpeechLM), V-JEPA 2 (World Model), and $\pi$ 0.5 (VLA).

BAGEL (Text-to-Image & Image Editing):
- Compared to vLLM-Omni, M* achieved ~30% lower p50 end-to-end latency for text-to-image and ~50% lower for image editing (with CFG parallelism).
- M* utilized a single configuration to optimize all modalities, whereas vLLM-Omni required trade-offs between configurations (e.g., single-stage vs. multi-stage) that degraded performance in specific tasks.
- Gains were attributed to M*'s paged KV cache management for CFG branches, avoiding the dense tensor concatenation used in vLLM-Omni.
Qwen3-Omni (Text-to-Speech):
- On the Seed-TTS benchmark, M* delivered up to 2.9× lower real-time factor (RTF) and 2.7× higher throughput compared to vLLM-Omni and SGLang-Omni.
- At batch size 16, throughput was ~15% higher than vLLM-Omni.
- Improvements stemmed from per-submodule CUDA graph capture (enabling the entire Talker loop to run as one graph) and eliminating inter-process communication for colocated components.
Orpheus (Speech):
- M* achieved 13.6% lower p50 RTF and throughput improvements ranging from 20% to 52% compared to VoxServe (a specialized speech serving system).
V-JEPA 2 (Robotic Planning/World Models):
- For action-conditioned rollouts, M* outperformed the native Meta implementation by up to 12.5× at a rollout horizon of 30.
- This was achieved by encoding the rollout as a DynamicLoop with paged-attention KV caching, avoiding the costly repeated prefills required by the baseline's hand-written loop.

5. Significance and Claims

The paper claims that M* paves the way for the efficient serving of complex, next-generation composite models with minimal developer effort. By treating multimodal inference as graph traversal, M* allows model authors to declare architectures once while enabling deployers to optimize physical placement and execution dynamically.

The authors assert that M* achieves performance on par with or better than state-of-the-art, modality-specific baselines (like VoxServe) and general-purpose frameworks (like vLLM-Omni) across a diverse set of tasks. The system's ability to handle arbitrary loops, parallelism, and streaming without custom glue code addresses the fundamental abstraction mismatch in current serving stacks, facilitating the deployment of unified multimodal, omni, and world models in real-world applications.

M*: A Modular, Extensible, Serving System for Multimodal Models