Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

Here is an explanation of the Mozart paper, broken down into simple concepts and everyday analogies.

🎼 The Big Idea: Building a "Brain" for AI

Imagine you are trying to build a super-intelligent robot (a Large Language Model, or LLM) that can write stories, code, and answer questions. To make this robot smart, you need a massive brain.

In the past, we tried to build this brain as one giant, solid block of concrete (a single giant computer chip). But there are two problems:

It's too heavy: You can't make the concrete block any bigger without it cracking or becoming too expensive.
It's inefficient: If the robot only needs to think about "cooking," but the whole brain is active, you are wasting a lot of energy.

The Solution: The Mixture of Experts (MoE)
Instead of one giant brain, imagine a team of specialists.

One expert is great at math.
One is great at poetry.
One is great at coding.
One is great at history.

When you ask the robot a question, a "manager" (the router) looks at the question and wakes up only the specific experts needed. If you ask about math, the poet stays asleep. This is called Mixture of Experts (MoE). It's much faster and cheaper.

🚧 The Problem: The Traffic Jam

While this "team of specialists" idea is great for software, it creates a nightmare for hardware (the physical chips).

Imagine these experts are people sitting in different rooms in a massive office building.

The Bottleneck: Every time a question comes in, the manager has to shout across the building to wake up the right people.
The Chaos: If the "Math Expert" and the "History Expert" are in rooms on opposite sides of the building, the shouting takes forever. The data has to travel through long, slow hallways (communication cables).
The Waste: While the experts are waiting for the data to arrive, they are just sitting there doing nothing. The building is huge, but the workers are idle.

Current computer chips are like this office building: they are too big to fit all the experts close together, and the hallways are too slow.

🎻 The Solution: Enter "Mozart"

The authors of this paper created a new system called Mozart. They didn't just write better software; they redesigned the entire "office building" (the hardware) to match the way the "team of specialists" (the software) works.

They call this Algorithm-Hardware Co-Design. Think of it as designing the furniture and the floor plan at the same time, rather than buying a house and then trying to fit the furniture in.

Here is how Mozart fixes the problems:

1. The "Smart Seating Chart" (Expert Allocation)

The Problem: In a normal office, the Math Expert might be in Room 1, and the History Expert in Room 100. They rarely talk to each other, but sometimes they do.
The Mozart Fix: Mozart analyzes the data first. It learns that the "Math Expert" and the "Physics Expert" always work together. So, it moves them to sitting right next to each other in the same room (or adjacent rooms).

Analogy: Instead of shouting across the office, the experts just whisper to their neighbor. This cuts down the "shouting" (communication) time drastically.

2. The "Assembly Line" (Fine-Grained Scheduling)

The Problem: Usually, the office works in steps:

Wait for all papers to arrive.
Do the work.
Wait for all results to be sent back.
This is slow because Step 2 waits for Step 1 to finish completely.
The Mozart Fix: Mozart uses a streaming approach.

Analogy: Imagine a sushi conveyor belt. As soon as the first piece of fish arrives, the chef starts cutting it. While the chef is cutting the first piece, the second piece of fish is already arriving on the belt. The chef never stops cutting, and the delivery truck never stops moving.
Mozart overlaps the "waiting for data" time with the "doing the work" time.

3. The "3.5D Wafer-Scale Chiplet" (The New Building)

The Problem: Traditional chips are flat (2D). You can't put enough experts on a flat surface without them being far apart.
The Mozart Fix: They built a 3D skyscraper made of small, modular blocks called "chiplets."

The Structure: Imagine a high-rise building where:
- The Ground Floor is a giant switchboard (the Network-on-Package) that connects everything instantly.
- The Floors are stacked vertically (3D). The "Memory" (the library of books) is stacked directly on top of the "Workers" (the experts).
The Benefit: Because the library is right on top of the worker's desk, they don't have to walk to the basement to get a book. They just reach up. This makes data access incredibly fast.

🏆 The Results: Why It Matters

When the researchers tested Mozart with three popular AI models (Qwen, OLMoE, and DeepSeek), the results were impressive:

Speed: It was 1.9x to 2.4x faster than current methods.
Efficiency: It used the computer's resources much better, meaning less wasted electricity and time.

🎯 The Takeaway

Mozart is like a genius architect who realized that the old way of building computer brains was inefficient. Instead of forcing a "team of specialists" to work in a generic, sprawling office, Mozart built a custom skyscraper where:

Specialists who work together sit next to each other.
Work flows like a conveyor belt so no one ever waits.
The building is stacked vertically to keep everything close.

This allows us to train much smarter, larger AI models much faster, paving the way for the next generation of artificial intelligence.

Here is a detailed technical summary of the paper "Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures."

1. Problem Statement

Large Language Models (LLMs) utilizing Mixture-of-Experts (MoE) architectures offer superior efficiency by activating only a subset of parameters per token. However, training these models on conventional hardware (e.g., GPUs) faces significant bottlenecks:

Memory Locality & Communication Overhead: The dynamic and sparse nature of MoE routing leads to inefficient memory access patterns and massive All-to-All communication overhead between parallel units.
Resource Utilization: Traditional hardware often assumes dense, uniform computation, leading to poor utilization when handling the heterogeneous and dynamic workloads of MoE.
Scalability Limits: Monolithic chip scaling is limited by photoreticle sizes and transistor scaling, while existing 2.5D/3D chiplet solutions often rely on coarse-grained, static workload partitioning that fails to address fine-grained MoE modularity.

2. Methodology: The Mozart Framework

Mozart is an algorithm-hardware co-design framework specifically tailored for training MoE-based LLMs on 3.5D wafer-scale chiplet architectures. It addresses the bottlenecks through three primary pillars:

A. Algorithm-Side Optimizations

Expert Clustering & Allocation (Specialized Layout):
- Profiling: The system analyzes the pre-training dataset to identify expert specialization (which experts are frequently activated) and expert collaboration (which pairs of experts are co-activated).
- Clustering: Using a farthest-point sampling-inspired algorithm, experts are clustered to maximize intra-cluster co-activation and minimize inter-cluster communication.
- Allocation: Clusters are mapped to chiplets via a binary integer programming approach to balance workloads across groups of chiplets sharing a DRAM interface. This reduces the volume of All-to-All communication by ensuring co-activated experts reside on the same or adjacent chiplets.
Fine-Grained Scheduling (Streaming):
- Streaming Tokens: The global token batch is partitioned into micro-batches to overlap DRAM communication (loading activations/weights) with on-chip computation.
- Streaming Experts: Expert clusters are loaded based on workload priority. Highly-activated clusters are loaded first to maximize overlap between communication and computation, mitigating the latency of off-chip memory access.

B. Hardware Architecture (3.5D Wafer-Scale)

Mozart proposes a novel physical architecture to support the algorithmic optimizations:

3.5D Integration: Combines vertical 3D stacking (Logic-on-Memory via hybrid bonding) with 2.5D interconnects.
- Logic-on-Memory: Compute dies are vertically stacked with SRAM dies, providing ultra-low latency access to frequently reused data (activations).
- Hierarchical Memory: Model weights reside in off-chip DRAM, while activations are cached in local SRAM.
2.5D NoP-Tree Topology:
- Central Attention Node: A dedicated chiplet handles memory-bound attention operations near the DRAM.
- Leaf Expert Nodes: Compute-bound expert clusters are organized as leaves.
- In-Network Aggregation: Switch modules possess compute capabilities to perform local aggregation of MoE outputs, reducing the need for data to traverse the entire network.
Disaggregated Design: Separates memory-bound (Attention) and compute-bound (Expert) tasks onto specialized chiplets with matching bandwidth and compute capabilities.

3. Key Contributions

Co-Design Framework: Introduced Mozart, a unified framework that jointly optimizes expert placement, scheduling, and hardware topology for MoE training.
Communication-Efficient Layout: Developed a strategy to place co-activated experts on the same chiplet, significantly reducing the All-to-All communication complexity ( $C_T$ ).
Novel 3.5D Architecture: Designed a wafer-scale system featuring a 2.5D NoP-Tree interconnect and 3D logic-on-memory stacks, specifically optimized for the sparse, modular nature of MoE.
Fine-Grained Pipeline: Implemented a streaming scheduling mechanism that effectively overlaps DRAM communication with on-chip computation, addressing the memory wall.

4. Experimental Results

The authors evaluated Mozart across three popular open-source MoE models: Qwen3-30B-A3B, OLMoE-1B-7B, and DeepSeek-MoE-16B.

Performance Gains: Mozart achieved 1.92× to 2.37× speedup in end-to-end post-training latency compared to baseline methods (standard expert parallelism on similar hardware assumptions).
- Qwen3-30B: 1.92× speedup.
- OLMoE-1B: 2.37× speedup.
- DeepSeek-MoE: 2.17× speedup.
Communication Reduction: The specialized expert layout reduced the All-to-All communication data volume ( $C_T$ ) significantly (e.g., from 8.0 to 4.32 for DeepSeek-MoE), directly correlating to lower latency.
Robustness: The system demonstrated reduced sensitivity to increasing sequence lengths and maintained high efficiency even with varying DRAM bandwidths (HBM2 vs. SSD), though HBM2 yielded the best results due to better overlap capabilities.
Bottleneck Analysis: The system is primarily memory-bound (constrained by sequential weight loading), but Mozart's optimizations successfully eliminate the computational bottlenecks associated with heavy MoE operations.

5. Significance

Scalability: Mozart demonstrates a viable path for scaling MoE-LLMs beyond the limits of monolithic GPUs by leveraging wafer-scale integration.
Efficiency: By aligning the logical modularity of MoE with the physical modularity of chiplets, it drastically reduces communication overhead and improves resource utilization.
Future-Proofing: The proposed 3.5D architecture and co-design principles offer a blueprint for future hardware accelerators tailored specifically for sparse, dynamic workloads like Mixture-of-Experts, moving beyond the "one-size-fits-all" GPU paradigm.

In conclusion, Mozart proves that algorithm-hardware co-design is essential for the next generation of LLM training, achieving near-doubling of training efficiency through specialized expert placement and a novel 3.5D wafer-scale architecture.