EROICA: Online Performance Troubleshooting for Large-scale Model Training

This paper presents EROICA, the first online troubleshooting system deployed on production-scale GPU clusters (~100,000 GPUs) that effectively diagnoses complex hardware and software performance issues in large-scale model training through fine-grained profiling and differential observability with minimal impact.

Yu Guan, Zhiyu Yin, Haoyu Chen, Sheng Cheng, Chaojie Yang, Kun Qian, Tianyin Xu, Pengcheng Zhang, Yang Zhang, Hanyu Zhao, Yong Li, Wei Lin, Dennis Cai, Ennan Zhai

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine you are the conductor of a massive orchestra with 100,000 musicians (GPUs) playing a complex symphony (training a giant AI model). The goal is to play the song perfectly and quickly.

But sometimes, the music slows down, gets stuck, or sounds terrible. In the past, figuring out why was like trying to find a single out-of-tune violin in a stadium full of 100,000 people while blindfolded.

EROICA is a new "super-conductor" system designed to solve this. Here is how it works, explained simply:

1. The Problem: The "Blind Spot"

Before EROICA, engineers had two bad options to find the problem:

  • Option A (The Wide-Angle Lens): They could watch the whole orchestra from a distance. They could see that the music was slow, but they couldn't tell which musician was playing wrong. It was too blurry to be useful.
  • Option B (The Microscope): They could zoom in on one musician with a super-microscope to see exactly what they were doing. But this took so much time and energy that they could only do it on one person at a time. By the time they found the problem, the orchestra had already played for hours, wasting money.

2. The Solution: EROICA's "Smart Snapshot"

EROICA is the first system that can do both at the same time: it watches the whole orchestra and sees every tiny detail, without slowing anyone down.

It works in three clever steps:

Step 1: The "Traffic Jam" Detector

EROICA constantly checks the speed of the music. If the orchestra suddenly slows down (a "performance degradation"), it instantly triggers a 20-second "super-snapshot" of every single musician simultaneously.

  • Analogy: Imagine a traffic camera that only snaps a photo of every car on a highway the exact second a traffic jam starts, rather than filming the whole day.

Step 2: The "Summarizer" (The Magic Trick)

This is the most important part. Usually, a snapshot of 100,000 musicians would create a mountain of data (terabytes of video) that no computer could analyze quickly.
EROICA doesn't save the video. Instead, it asks every musician to write a tiny 3-line report:

  1. How much of the time were you actually playing? (Did you sit idle?)
  2. How hard were you working? (Was your heart rate high?)
  3. Was your rhythm steady or shaky? (Did you stutter?)
  • Analogy: Instead of watching a 2-hour movie of the concert, EROICA asks every musician to send a single text message saying: "I played 90% of the time, my heart rate was normal, but my rhythm was shaky."
  • Result: Instead of 3 Terabytes of data, EROICA only receives 30 Kilobytes of text. It's like shrinking a library down to a single postcard.

Step 3: The "Detective"

The central computer takes all those tiny text reports and compares them.

  • If 99,999 musicians say "My rhythm was steady," but one says "My rhythm was shaky," the computer instantly knows: "That one guy is the problem!"
  • It can also spot if everyone is playing slowly because the conductor (the code) is giving bad instructions.

3. Real-World Superpowers

Because EROICA is so smart and fast, it has solved problems that were previously impossible:

  • The Broken Wire: It found a single broken network cable connecting two computers in a cluster of thousands, which was slowing down the whole group.
  • The Lazy Worker: It spotted a specific computer that was doing extra, unnecessary work (like a musician practicing scales while everyone else was playing the song), causing the rest to wait.
  • The AI Auto-Fix: In one case, EROICA found a specific line of code causing a "deadlock" (a traffic jam in the software). It fed this information to an AI assistant, which wrote the fix code instantly, restarting the training without human help.

Why It Matters

Before EROICA, fixing these issues took days or weeks of guessing and testing. Now, EROICA can diagnose a problem in a 3,000-computer cluster in 3 minutes and a 1,000,000-computer cluster in 7 minutes.

It turns the impossible task of finding a needle in a haystack into simply asking the haystack, "Who is holding the needle?" and getting an immediate answer.