Scheduling Parallel Optical Circuit Switches for AI Training

This paper introduces Spectra, a three-step scheduling algorithm that efficiently manages time-varying AI traffic across parallel optical circuit switches by decomposing traffic matrices, assigning permutations load-awarely, and balancing loads through controlled splitting, thereby significantly reducing makespan compared to state-of-the-art baselines.

Kevin Liang, Litao Qiao, Isaac Keslassy, Bill Lin

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are the manager of a massive, high-speed delivery hub for a company that builds super-intelligent AI robots. This company needs to move huge amounts of "brain data" between thousands of computers (GPUs) every second to train these robots.

In the old days, this data moved through electronic switches, like a busy highway with traffic lights. But as the AI gets smarter, the traffic gets so heavy that the electronic highways are jamming, overheating, and moving too slowly.

The Solution: The Optical Express Lanes
To fix this, engineers are building Optical Circuit Switches (OCS). Think of these as magical, invisible tunnels that can instantly connect any two computers. They are incredibly fast and use very little energy.

However, there's a catch: The Tunnel Setup Time.
Imagine these optical switches are like a set of train tracks. To send a train from Station A to Station B, you have to physically move the tracks to align them. This takes a tiny bit of time (a "reconfiguration delay"). If you have to send many different packages to many different places, you can't just keep moving the tracks back and forth for every single package, or you'll spend all your time moving tracks and no time moving packages.

The New Problem: Too Many Tunnels, Too Much Traffic
To handle the massive AI traffic, data centers are now using multiple parallel optical switches (like having 4, 8, or 16 sets of these magical tunnels running side-by-side).

The big question is: How do we decide which package goes into which tunnel, and in what order, so that all the packages arrive as fast as possible?

If you send a huge package to Tunnel 1 and a tiny one to Tunnel 2, Tunnel 1 will finish late, and the whole system has to wait for it. This waiting time is called the Makespan. The goal is to minimize this waiting time.

The Paper's Hero: SPECTRA
The authors of this paper created a new algorithm called SPECTRA (Scheduling ParallEl Circuit switches for data cen-ter TRAffic). They realized that existing methods were like trying to solve a puzzle by just guessing. SPECTRA solves it in three clever steps, using a simple analogy: The Pizza Party.

Imagine you have a giant, messy pizza (the Traffic Demand) that needs to be sliced and distributed to 4 hungry friends (the Parallel Switches).

Step 1: DECOMPOSE (Cutting the Pizza)

First, you look at the messy pizza. It has toppings scattered everywhere. You need to cut it into perfect, clean slices (called Permutations) where no two slices overlap in a way that causes a traffic jam.

  • The Trick: SPECTRA uses a mathematical rule to find the minimum number of perfect slices needed to cover the whole pizza. This ensures you don't have to move the "tracks" (reconfigure the switch) too many times.

Step 2: SCHEDULE (Passing the Slices)

Now you have a stack of slices of different sizes. You have 4 friends (switches) waiting to eat.

  • The Trick: SPECTRA uses a "Longest First" strategy. It takes the biggest slice and gives it to the friend who is currently the least hungry (has the least load). Then it takes the next biggest slice and gives it to the next least hungry friend.
  • Why? This prevents one friend from getting stuck with three huge slices while the others are starving. It balances the load.

Step 3: EQUALIZE (The Final Tweak)

Sometimes, even with the best planning, one friend ends up with a slice that is just a tiny bit too big, making them finish last.

  • The Trick: SPECTRA looks at the friend with the biggest slice. If that slice is huge, it cuts off a small piece of it and hands it to the friend who finished early.
  • The Catch: Moving a piece of pizza to a new friend takes a tiny bit of time (the setup delay). SPECTRA is smart enough to know: "Is it worth cutting the slice and moving it, or is the time saved worth the time spent moving it?" It only does this if it actually speeds up the whole party.

The Results: Why It Matters

The authors tested SPECTRA against other methods using real-world AI training data (like the famous GPT models and a new "Mixture of Experts" model).

  • The Result: SPECTRA was 1.4x to 2.4x faster than the previous best methods.
  • The Metaphor: If the old method took 100 minutes to finish the training, SPECTRA did it in about 40 to 70 minutes.
  • The "Lower Bound": The authors also calculated the absolute theoretical fastest time possible (the "speed of light" limit for this problem). SPECTRA got incredibly close to that limit, proving it's nearly perfect.

In Summary

AI training is like a massive, high-stakes relay race. The paper introduces a new coach (SPECTRA) who organizes the runners (data packets) and the baton passes (switches) so efficiently that the team finishes the race in record time, beating the competition by a huge margin. This means AI models can be trained faster, cheaper, and with less energy.