M*: A Modular, Extensible, Serving System for Multimodal Models

This paper introduces M*, a modular and extensible serving system that represents composite multimodal models as dataflow graphs to enable flexible component composition and distributed optimization, achieving significant latency and throughput improvements over existing frameworks across diverse tasks like text-to-image, text-to-speech, and robotic planning.

Original authors: Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang

Published 2026-06-12
📖 5 min read🧠 Deep dive

Original authors: Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are running a massive, high-tech kitchen.

The Old Way (Existing Systems):
For a long time, AI models were like simple, single-lane assembly lines. If you wanted to bake a cake (generate text), every ingredient went down the same path: mix, bake, frost, serve. Systems like vLLM were built specifically for this "text-only" assembly line. They were incredibly fast at baking cakes.

But now, AI is entering a new era. We aren't just baking cakes anymore; we are running a composite kitchen. We have models that can:

  • Look at a photo and describe it (Vision).
  • Listen to a voice and reply in a different voice (Speech).
  • Watch a video and predict what happens next (World Models).
  • Control a robot arm based on a command (Robotics).

These new "composite" models are messy. They don't follow a single straight line. Sometimes they need to loop back and check their work (like a diffusion model refining an image). Sometimes they need to split into three parallel paths at once (like checking a picture from three different angles). Sometimes they need to stream audio in real-time, one word at a time.

The old kitchen systems (vLLM, SGLang) tried to force these complex, multi-task models into their simple, straight-line assembly lines. It was like trying to fit a juggling act into a conveyor belt. It worked, but it was slow, clunky, and required a lot of custom "glue" code to make the pieces stick together.

The New Solution: M*
The paper introduces M, a new operating system for these complex AI kitchens. Instead of forcing models into a straight line, M treats them like a dynamic map or a flowchart.

Here is how M* works, using simple analogies:

1. The "Walk Graph" (The Map)

In the old days, the kitchen manager had to know the exact recipe for every dish. In M*, the model creator just draws a map (a graph) of the kitchen.

  • Nodes: These are the stations (e.g., "The Vision Station," "The Language Station," "The Audio Station").
  • Walks: These are the specific routes a customer's order takes.
    • Order A (Text-to-Image): Walks from the Language Station → loops 50 times at the Art Station → ends at the Printer.
    • Order B (Speech-to-Speech): Walks from the Mic Station → splits into two paths → merges back → goes to the Speaker.

M* calls this the Walk Graph. It allows the system to understand that different tasks take different paths through the same kitchen, without needing to rewrite the kitchen's layout every time.

2. The "Conductor" (The Traffic Controller)

Once the map is drawn, M* has a Conductor. This is the smart traffic controller who doesn't care what the food is, only where it needs to go.

  • If a request needs to loop back (like refining an image), the Conductor sends it back to the start of the loop.
  • If a request needs to split into three (like checking safety, style, and content), the Conductor sends three copies to three different chefs simultaneously.
  • If a request is streaming audio, the Conductor ensures the chef hands over the first word immediately, rather than waiting for the whole sentence to finish.

3. Flexible Seating (Placement)

In the old systems, if you had a big chef (a large AI model), you had to put them on one giant table (one GPU). If you had a small helper (a small encoder), they had to sit at a tiny table.
M* lets you disaggregate the kitchen. You can put the "Vision Chef" on one computer, the "Language Chef" on another, and the "Audio Chef" on a third. They can talk to each other instantly. This means you can scale up just the part of the model that is slow, without wasting money on the parts that are fast.

What Did They Prove? (The Results)

The authors tested M* against the old systems (vLLM-Omni, SGLang-Omni, VoxServe) on five different types of complex models. Here is what happened:

  • Image Generation (BAGEL): When turning text into images, M* was 20% to 50% faster than the old systems. It handled the complex "loops" needed to refine the image much more efficiently.
  • Speech (Qwen3-Omni & Orpheus): When generating speech, M* was up to 2.9 times faster in real-time performance and handled 2.7 times more requests at once. It was like switching from a slow, stuttering radio to a crisp, live broadcast.
  • Robotics (V-JEPA 2): When predicting future video frames for robot planning, M* was up to 12.5 times faster. The old system had to re-calculate everything from scratch every step; M* remembered the previous steps and just updated the new ones.

The Bottom Line

M* is a universal system that stops trying to force every AI model into a single, rigid shape. Instead, it lets the model define its own shape (a graph) and then builds a flexible, high-speed engine to run that shape.

It's the difference between a rigid assembly line that breaks when you try to build a car instead of a toaster, and a smart, adaptable workshop that can build anything from a toaster to a rocket ship with equal speed and efficiency. The paper claims this allows developers to deploy these complex, multi-modal models with minimal effort while getting significantly better performance.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →