Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

The paper introduces Mozart, an algorithm-hardware co-design framework that leverages 3.5D wafer-scale chiplet architectures with specialized expert allocation and scheduling strategies to overcome communication and memory bottlenecks in the efficient training of large-scale Mixture-of-Experts (MoE) language models.

Shuqing Luo (Katie), Ye Han (Katie), Pingzhi Li (Katie), Jiayin Qin (Katie), Jie Peng (Katie), Yang (Katie), Zhao (Kevin), Yu (Kevin), Cao, Tianlong Chen

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Here is an explanation of the Mozart paper, broken down into simple concepts and everyday analogies.

🎼 The Big Idea: Building a "Brain" for AI

Imagine you are trying to build a super-intelligent robot (a Large Language Model, or LLM) that can write stories, code, and answer questions. To make this robot smart, you need a massive brain.

In the past, we tried to build this brain as one giant, solid block of concrete (a single giant computer chip). But there are two problems:

  1. It's too heavy: You can't make the concrete block any bigger without it cracking or becoming too expensive.
  2. It's inefficient: If the robot only needs to think about "cooking," but the whole brain is active, you are wasting a lot of energy.

The Solution: The Mixture of Experts (MoE)
Instead of one giant brain, imagine a team of specialists.

  • One expert is great at math.
  • One is great at poetry.
  • One is great at coding.
  • One is great at history.

When you ask the robot a question, a "manager" (the router) looks at the question and wakes up only the specific experts needed. If you ask about math, the poet stays asleep. This is called Mixture of Experts (MoE). It's much faster and cheaper.

🚧 The Problem: The Traffic Jam

While this "team of specialists" idea is great for software, it creates a nightmare for hardware (the physical chips).

Imagine these experts are people sitting in different rooms in a massive office building.

  • The Bottleneck: Every time a question comes in, the manager has to shout across the building to wake up the right people.
  • The Chaos: If the "Math Expert" and the "History Expert" are in rooms on opposite sides of the building, the shouting takes forever. The data has to travel through long, slow hallways (communication cables).
  • The Waste: While the experts are waiting for the data to arrive, they are just sitting there doing nothing. The building is huge, but the workers are idle.

Current computer chips are like this office building: they are too big to fit all the experts close together, and the hallways are too slow.

🎻 The Solution: Enter "Mozart"

The authors of this paper created a new system called Mozart. They didn't just write better software; they redesigned the entire "office building" (the hardware) to match the way the "team of specialists" (the software) works.

They call this Algorithm-Hardware Co-Design. Think of it as designing the furniture and the floor plan at the same time, rather than buying a house and then trying to fit the furniture in.

Here is how Mozart fixes the problems:

1. The "Smart Seating Chart" (Expert Allocation)

The Problem: In a normal office, the Math Expert might be in Room 1, and the History Expert in Room 100. They rarely talk to each other, but sometimes they do.
The Mozart Fix: Mozart analyzes the data first. It learns that the "Math Expert" and the "Physics Expert" always work together. So, it moves them to sitting right next to each other in the same room (or adjacent rooms).

  • Analogy: Instead of shouting across the office, the experts just whisper to their neighbor. This cuts down the "shouting" (communication) time drastically.

2. The "Assembly Line" (Fine-Grained Scheduling)

The Problem: Usually, the office works in steps:

  1. Wait for all papers to arrive.
  2. Do the work.
  3. Wait for all results to be sent back.
    This is slow because Step 2 waits for Step 1 to finish completely.
    The Mozart Fix: Mozart uses a streaming approach.
  • Analogy: Imagine a sushi conveyor belt. As soon as the first piece of fish arrives, the chef starts cutting it. While the chef is cutting the first piece, the second piece of fish is already arriving on the belt. The chef never stops cutting, and the delivery truck never stops moving.
  • Mozart overlaps the "waiting for data" time with the "doing the work" time.

3. The "3.5D Wafer-Scale Chiplet" (The New Building)

The Problem: Traditional chips are flat (2D). You can't put enough experts on a flat surface without them being far apart.
The Mozart Fix: They built a 3D skyscraper made of small, modular blocks called "chiplets."

  • The Structure: Imagine a high-rise building where:
    • The Ground Floor is a giant switchboard (the Network-on-Package) that connects everything instantly.
    • The Floors are stacked vertically (3D). The "Memory" (the library of books) is stacked directly on top of the "Workers" (the experts).
  • The Benefit: Because the library is right on top of the worker's desk, they don't have to walk to the basement to get a book. They just reach up. This makes data access incredibly fast.

🏆 The Results: Why It Matters

When the researchers tested Mozart with three popular AI models (Qwen, OLMoE, and DeepSeek), the results were impressive:

  • Speed: It was 1.9x to 2.4x faster than current methods.
  • Efficiency: It used the computer's resources much better, meaning less wasted electricity and time.

🎯 The Takeaway

Mozart is like a genius architect who realized that the old way of building computer brains was inefficient. Instead of forcing a "team of specialists" to work in a generic, sprawling office, Mozart built a custom skyscraper where:

  1. Specialists who work together sit next to each other.
  2. Work flows like a conveyor belt so no one ever waits.
  3. The building is stacked vertically to keep everything close.

This allows us to train much smarter, larger AI models much faster, paving the way for the next generation of artificial intelligence.