MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

MegaScale-Data is an industrial-grade distributed data loading architecture that addresses workload imbalance and memory redundancy in multisource large foundation model training through disaggregated preprocessing, centralized orchestration, and auto-partitioning, achieving up to 4.5x throughput improvement and 13.5x reduction in CPU memory usage.

Juntao Zhao, Qi Lu, Wei Jia, Borui Wan, Lei Zuo, Junda Feng, Jianyu Jiang, Yangrui Chen, Shuaishuai Cao, Jialing He, Kaihua Jiang, Yuanzhe Hu, Shibiao Nong, Yanghua Peng, Haibin Lin, Chuan Wu

Published 2026-03-17
📖 4 min read☕ Coffee break read

Imagine you are running a massive, high-stakes cooking competition to create the world's most delicious dish (the Large Foundation Model). You have thousands of chefs (GPUs) working in a giant kitchen, and they need a constant stream of ingredients (data) to keep cooking.

The problem? Your ingredients come from hundreds of different farms, fisheries, and spice markets (Multisource Data). Some are fresh vegetables (text), some are frozen fish (images), and some are exotic spices (video).

In the old way of doing things (the Old Dataloader), every single chef had their own personal assistant running around the market, buying ingredients, washing them, chopping them, and bringing them back.

  • The Problem: If one chef gets a huge order of hard-to-chop vegetables, they fall behind, slowing down the whole kitchen. Also, if you have 1,000 chefs, you need 1,000 assistants all running to the same market, buying the same onions, and carrying them back. It's a traffic jam, and your kitchen is drowning in paperwork (memory usage).

MegaScale-Data is the new, revolutionary kitchen management system that fixes this. Here is how it works, using simple analogies:

1. The "Specialized Assembly Line" (Disaggregation)

Instead of every chef having their own assistant, MegaScale-Data splits the work into two specialized teams:

  • The Scavengers (Source Loaders): These are dedicated runners who only go to specific markets. One runner only goes to the fish market, another only to the spice market. They grab the raw ingredients and bring them to a central station. They don't worry about chopping or cooking; they just fetch. This stops the chaos of 1,000 people running to the same store.
  • The Prep Chefs (Data Constructors): These chefs sit at a central station. They take the raw ingredients from the Scavengers, chop them, mix them into bowls (batches), and hand them to the main Chefs (GPUs).
  • The Benefit: The Scavengers don't need to know about the cooking; the Prep Chefs don't need to run to the market. This eliminates the "traffic jam" and saves a massive amount of space (memory) because you aren't duplicating paperwork for every single chef.

2. The "Smart Traffic Controller" (Centralized Orchestration)

In the old kitchen, if a chef needed a 10-minute steak and another needed a 2-minute salad, the system didn't care. The chef with the steak would finish last, making everyone else wait (this is called a straggler).

MegaScale-Data has a Smart Traffic Controller (The Planner).

  • It looks at the whole kitchen and says, "Chef A, you have a heavy task. Let's give you a few quick salads to balance it out. Chef B, you have a light task; let's give you a big steak."
  • It mixes the ingredients before they reach the chefs to ensure everyone is equally busy. This prevents anyone from sitting idle while waiting for the slowest person to finish.
  • The Result: The whole kitchen cooks much faster because no one is ever waiting around.

3. The "Dynamic Staffing" (Auto-Scaling)

Imagine the recipe changes every hour. Sometimes you need 90% fish and 10% spices; next hour, it's 50/50.

  • Old Way: You hire a fixed number of assistants. If you suddenly need 10x more fish, your assistants can't keep up, and the chefs stop cooking.
  • MegaScale-Data: The system is like a smart HR manager. It sees the demand for fish is spiking, so it instantly hires more "Fish Scavengers." If the spice demand drops, it sends those assistants home. It scales the team up and down automatically based on what the recipe needs right now.

4. The "Backup Plan" (Fault Tolerance)

In a giant kitchen, sometimes a runner trips, or a fridge breaks.

  • Old Way: The whole kitchen stops until you fix the runner.
  • MegaScale-Data: It has Shadow Runners standing by. If a runner drops a tray, a Shadow Runner instantly picks up the exact same tray and continues the job without the chefs even noticing a pause. It also keeps "checkpoints" (like saving your progress in a video game) so you never lose your place.

The Big Win

By reorganizing the kitchen this way, the authors achieved two massive results:

  1. Speed: They cooked 4.5 times faster. The chefs never stopped moving.
  2. Efficiency: They used 13.5 times less memory. They stopped buying 1,000 copies of the same map and recipe book, saving huge amounts of space.

In short: MegaScale-Data stops treating data loading like a chaotic, individual struggle and turns it into a synchronized, smart, and scalable assembly line, allowing AI models to learn faster and cheaper.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →