MPK: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

Mirage Persistent Kernel (MPK) is a novel compiler and runtime system that automatically transforms multi-GPU tensor programs into a single high-performance mega-kernel using SM-level graph representations to enable cross-operator pipelining and fine-grained overlap of computation and communication, thereby significantly reducing LLM inference latency compared to traditional kernel-per-operator approaches.

Original authors: Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zihao Ye, Songti
Published 2026-06-11
📖 5 min read🧠 Deep dive

Original authors: Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zihao Ye, Songting Wang, Wenqin Yang, Xupeng Miao, Tianqi Chen, Zhihao Jia

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are running a massive, high-speed factory that builds complex structures (like AI models). In the current way of doing things, every single step of the construction process is handled by a different specialized team.

The Old Way: The "Stop-and-Go" Factory
Right now, most AI systems work like a factory where Team A builds a wall, then shouts "Done!" and stops. The factory manager (the computer's operating system) has to pause everything, check the paperwork, and then call in Team B to paint the wall. Once Team B finishes, they stop, and the manager calls Team C to install the windows.

This creates a lot of wasted time:

  1. The "Stop" Signal: Every time a team finishes, there's a mandatory pause to make sure everyone is ready for the next team. This is like a red light at every intersection.
  2. The Manager's Busy Work: The manager spends too much time running back and forth between teams, handing out new instructions, instead of letting the teams work.
  3. No Overlap: Even if Team B only needs a small part of the wall that Team A just finished, Team B has to wait for the entire wall to be built before they can start. They can't start painting just the first brick while the rest of the wall is still being built.

The New Way: MPK (Mirage Persistent Kernel)
The paper introduces MPK, which is like turning that factory into a single, continuous, super-efficient assembly line run by one giant, self-managing team.

Here is how MPK changes the game, using simple analogies:

1. The "One Big Kernel" (The Mega-Factory)

Instead of calling in different teams one by one, MPK launches one single, massive team that stays on the job from start to finish. This team doesn't stop between steps. They don't wait for a manager to tell them what to do next; they just keep moving.

  • The Benefit: It eliminates the "stop-and-go" delays. It's like a train that never stops at stations to switch engines; it just keeps rolling.

2. The "SM-Level Map" (The Detailed Blueprint)

The paper introduces a new way of drawing the blueprint, called a tGraph.

  • Old Blueprint: "Build Wall A, then Paint Wall A." (Too big and vague).
  • MPK Blueprint: "Worker 1 builds Brick 1, Worker 2 builds Brick 2, Worker 3 paints Brick 1 while Worker 4 builds Brick 3."
  • The Magic: MPK breaks the work down to the level of individual workers (called SMs or Streaming Multiprocessors). It knows exactly which worker needs which piece of data and when. This allows different workers to do different jobs (like building and painting) at the exact same time, as long as they don't step on each other's toes.

3. The "Self-Driving" Runtime (The Smart Foreman)

Inside this giant team, there is a smart, self-managing system (the In-Kernel Runtime).

  • No Middleman: Instead of a manager outside the factory shouting instructions, the workers talk directly to each other.
  • Just-in-Time vs. Ready-to-Go:
    • If a task is tricky and depends on unpredictable things (like how long a sentence is in a chat), the system waits until the previous step is actually done before starting the next one (Just-in-Time).
    • If a task is predictable, the system lines it up and gets it ready before the previous step is even finished (Ahead-of-Time).
  • The Result: The workers never sit idle waiting for instructions. They are always busy.

4. The "Paged Shared Memory" (The Shared Toolbox)

In the old factory, each team had their own locked toolbox. If Team A needed a hammer, they had to wait for Team B to finish and put the hammer back.

  • MPK's Solution: They use a "Paged Shared Memory." Imagine a giant, shared toolbox where tools are organized into small, movable pages. As soon as a worker finishes with a tool, they put it back on the shelf, and the next worker can immediately grab it. This allows workers to grab materials for their next job while they are still finishing their current job.

Why Does This Matter?

The paper tested this system on Large Language Models (the brains behind AI chatbots) using powerful NVIDIA GPUs.

  • Speed: MPK made the AI run 1.7 times faster than the best existing systems in some cases.
  • Latency: It reduced the time it takes to generate a single word of text, bringing it very close to the physical limits of the hardware.
  • Ease of Use: The best part? You don't need to be a factory expert to use it. You can take a standard AI model (written in PyTorch) and "MPK-ify" it with just a few lines of code. The system automatically figures out how to break the work down and manage the workers.

In Summary:
MPK takes a chaotic factory where teams stop and start constantly, and turns it into a smooth, continuous, self-managing assembly line. By breaking work down to the smallest possible pieces and letting the workers coordinate directly, it removes all the wasted time, making AI inference significantly faster and more efficient.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →