MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

Imagine you are running a massive, high-stakes cooking competition to create the world's most delicious dish (the Large Foundation Model). You have thousands of chefs (GPUs) working in a giant kitchen, and they need a constant stream of ingredients (data) to keep cooking.

The problem? Your ingredients come from hundreds of different farms, fisheries, and spice markets (Multisource Data). Some are fresh vegetables (text), some are frozen fish (images), and some are exotic spices (video).

In the old way of doing things (the Old Dataloader), every single chef had their own personal assistant running around the market, buying ingredients, washing them, chopping them, and bringing them back.

The Problem: If one chef gets a huge order of hard-to-chop vegetables, they fall behind, slowing down the whole kitchen. Also, if you have 1,000 chefs, you need 1,000 assistants all running to the same market, buying the same onions, and carrying them back. It's a traffic jam, and your kitchen is drowning in paperwork (memory usage).

MegaScale-Data is the new, revolutionary kitchen management system that fixes this. Here is how it works, using simple analogies:

1. The "Specialized Assembly Line" (Disaggregation)

Instead of every chef having their own assistant, MegaScale-Data splits the work into two specialized teams:

The Scavengers (Source Loaders): These are dedicated runners who only go to specific markets. One runner only goes to the fish market, another only to the spice market. They grab the raw ingredients and bring them to a central station. They don't worry about chopping or cooking; they just fetch. This stops the chaos of 1,000 people running to the same store.
The Prep Chefs (Data Constructors): These chefs sit at a central station. They take the raw ingredients from the Scavengers, chop them, mix them into bowls (batches), and hand them to the main Chefs (GPUs).
The Benefit: The Scavengers don't need to know about the cooking; the Prep Chefs don't need to run to the market. This eliminates the "traffic jam" and saves a massive amount of space (memory) because you aren't duplicating paperwork for every single chef.

2. The "Smart Traffic Controller" (Centralized Orchestration)

In the old kitchen, if a chef needed a 10-minute steak and another needed a 2-minute salad, the system didn't care. The chef with the steak would finish last, making everyone else wait (this is called a straggler).

MegaScale-Data has a Smart Traffic Controller (The Planner).

It looks at the whole kitchen and says, "Chef A, you have a heavy task. Let's give you a few quick salads to balance it out. Chef B, you have a light task; let's give you a big steak."
It mixes the ingredients before they reach the chefs to ensure everyone is equally busy. This prevents anyone from sitting idle while waiting for the slowest person to finish.
The Result: The whole kitchen cooks much faster because no one is ever waiting around.

3. The "Dynamic Staffing" (Auto-Scaling)

Imagine the recipe changes every hour. Sometimes you need 90% fish and 10% spices; next hour, it's 50/50.

Old Way: You hire a fixed number of assistants. If you suddenly need 10x more fish, your assistants can't keep up, and the chefs stop cooking.
MegaScale-Data: The system is like a smart HR manager. It sees the demand for fish is spiking, so it instantly hires more "Fish Scavengers." If the spice demand drops, it sends those assistants home. It scales the team up and down automatically based on what the recipe needs right now.

4. The "Backup Plan" (Fault Tolerance)

In a giant kitchen, sometimes a runner trips, or a fridge breaks.

Old Way: The whole kitchen stops until you fix the runner.
MegaScale-Data: It has Shadow Runners standing by. If a runner drops a tray, a Shadow Runner instantly picks up the exact same tray and continues the job without the chefs even noticing a pause. It also keeps "checkpoints" (like saving your progress in a video game) so you never lose your place.

The Big Win

By reorganizing the kitchen this way, the authors achieved two massive results:

Speed: They cooked 4.5 times faster. The chefs never stopped moving.
Efficiency: They used 13.5 times less memory. They stopped buying 1,000 copies of the same map and recipe book, saving huge amounts of space.

In short: MegaScale-Data stops treating data loading like a chaotic, individual struggle and turns it into a synchronized, smart, and scalable assembly line, allowing AI models to learn faster and cheaper.

1. Problem Statement

Training Large Foundation Models (LFMs), particularly Vision-Language Models (VLMs), faces critical bottlenecks in data preprocessing when dealing with multisource data and hybrid parallelism (Data Parallelism, Pipeline Parallelism, Context Parallelism, Tensor Parallelism). Existing dataloader frameworks (e.g., PyTorch DataLoader, tf.data) are ill-equipped for this scale, leading to two fundamental challenges:

Workload Imbalance & Training Efficiency:
- LFMs often combine diverse data sources (text, images, video) with non-uniform sample distributions.
- Due to the quadratic computational complexity of attention mechanisms ( $O(L^2)$ ), microbatches with varying sequence lengths cause significant computational disparities.
- In hybrid parallelism (e.g., VLMs with separate encoders and backbones), different modules process different parts of the data. Without coordination, this leads to severe stragglers, pipeline bubbles, and GPU idle time, degrading end-to-end throughput.
Memory Redundancy & Scalability Limits:
- Source Redundancy: Each dataloader worker maintains independent file access states (sockets, metadata, I/O buffers) for every data source. With hundreds of sources, memory usage scales linearly, consuming excessive CPU RAM.
- Parallelism Redundancy: In hybrid parallelism (PP/CP), multiple GPU ranks often run identical dataloader instances to fetch the same data, leading to massive redundant memory consumption and I/O bandwidth waste.
- Heterogeneity: Different modalities (e.g., video decoding vs. text tokenization) have vastly different preprocessing costs. Current systems must over-provision workers to handle the slowest source, leading to resource inefficiency.

2. Methodology: MegaScale-Data Architecture

MegaScale-Data is an industrial-grade, distributed data loading architecture built on the Ray framework. It introduces a disaggregated design that separates data preparation from final delivery, utilizing an actor-model approach.

Core Components

Source Loaders (Disaggregated Preprocessing):
- Dedicated actors responsible for specific data sources.
- Perform sample-level transformations (e.g., JPEG decoding, tokenization).
- Innovation: By isolating file access states per source, they eliminate redundant metadata and socket connections across the system, solving the "Source Scaling" memory issue.
Data Constructors (Parallelism-Aware Aggregation):
- Act as data sinks for specific training ranks (e.g., Data Parallel groups).
- Aggregate outputs from Source Loaders to perform batch-level operations (padding, packing) and parallelism transformations (splitting for CP/PP).
- Innovation: Enables seamless data sharing; ranks in the same parallel group can derive partitions from a single batch, eliminating redundant fetching and storage.
Planner (Centralized Orchestration):
- The brain of the system, managing the global data flow.
- DGraph (Data Graph): A stateful dataflow graph that tracks sample lifecycles, dependencies, and lineage across sources.
- ClientPlaceTree: A logical topology model representing the trainer's device mesh (DP, PP, CP, TP). It enables hybrid parallelism-aware scheduling.
- AutoScaler: Dynamically adjusts the number of Source Loader actors and workers based on real-time mixing ratios and heterogeneous processing costs.

Key Mechanisms

Declarative Data Plane: Users define data mixing strategies (e.g., curriculum learning, long-short context balancing) using high-level primitives (mix, distribute, balance, broadcast_at) rather than writing low-level code.
Multi-Level Auto-Partitioning:
- Offline: Clusters data sources by transformation cost and assigns worker counts to balance memory and compute.
- Online (Mixture-Driven Scaling): Dynamically reshards Source Loaders when sampling weights change (e.g., during curriculum learning) to maintain optimal resource utilization.
Fault Tolerance: Uses "Shadow Loaders" for hot-standby failover and differential checkpointing to minimize recovery latency without stalling training.

3. Key Contributions

Disaggregated Multisource Preprocessing: A distributed actor-model pipeline that eliminates redundant data access and memory overhead in multisource, hybrid-parallel settings.
Declarative Load-Time Orchestration: Introduces DGraph and ClientPlaceTree abstractions, allowing complex cross-module data scheduling (e.g., balancing image/text tokens across VLM modules) with minimal user coding.
Adaptive Multisource Scaling: Algorithms that dynamically optimize CPU utilization based on heterogeneous source costs and evolving data mixing ratios.
Robust Deployment: Comprehensive mechanisms for fault tolerance (shadow loaders) and elastic resharding to handle dynamic cluster changes.

4. Experimental Results

Evaluated on clusters up to 4,096 GPUs with various VLM configurations (ViT + LLaMA/Mixtral) and datasets (coyo700m, navit_data):

Throughput Improvement: Achieved up to 4.5× improvement in end-to-end training throughput compared to state-of-the-art baselines (PyTorch DataLoader, Ray Data, Pecan).
Memory Efficiency: Reduced CPU memory usage by 13.5× by eliminating parallelism and source-level redundancies.
Scalability:
- Maintained stable throughput scaling up to 4,096 GPUs, whereas baseline systems collapsed due to communication bottlenecks.
- Demonstrated that load balancing strategies yield higher gains with longer context lengths (up to 3.09× speedup for 16k contexts) due to increased heterogeneity.
Overhead: The coordination overhead (fetch latency, planning time) is negligible (<1% of iteration time) and fully overlapped by training computation.
Convergence: The balancing strategies did not negatively impact model convergence; training loss curves remained consistent with non-balanced baselines.

5. Significance

MegaScale-Data addresses a critical gap in the infrastructure for next-generation AI training. As models grow larger and training data becomes more diverse (multimodal, multi-source), the data pipeline has become a primary bottleneck.

Paradigm Shift: It moves away from "colocated" dataloaders (tightly coupled with training processes) to a "disaggregated" architecture, treating data loading as a scalable, stateful service.
Enabling Hybrid Parallelism: It provides the first practical solution for efficiently managing data flow in complex 4D parallelism (PP+DP+CP+TP) scenarios, which are essential for training trillion-parameter models.
Resource Efficiency: By drastically reducing memory footprint, it allows training jobs to run on existing hardware without requiring massive over-provisioning of CPU resources, lowering the cost of training frontier models.

In summary, MegaScale-Data transforms the dataloader from a passive data fetcher into an intelligent, scalable orchestration engine, enabling efficient, stable, and cost-effective training of large foundation models.

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

1. The "Specialized Assembly Line" (Disaggregation)

2. The "Smart Traffic Controller" (Centralized Orchestration)

3. The "Dynamic Staffing" (Auto-Scaling)

4. The "Backup Plan" (Fault Tolerance)

The Big Win

1. Problem Statement

2. Methodology: MegaScale-Data Architecture

Core Components

Key Mechanisms

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Exploration and Exploitation Errors Are Measurable for Language Model Agents

SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach

WebXSkill: Skill Learning for Autonomous Web Agents