Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters

Imagine you own a massive, high-tech kitchen (a Deep Learning Cluster) where you need to cook hundreds of different complex dishes (Deep Learning Models) at the same time.

Your kitchen is equipped with a mix of tools: some are super-fast, high-end blenders (V100 GPUs), some are standard blenders (P100 GPUs), and some are older, slower blenders (K80 GPUs).

The problem is that your current kitchen manager (existing schedulers like Gavel) is a bit rigid. They treat every dish as a single, indivisible block. If a recipe needs "4 high-speed blenders" but you only have 3 available, the manager says, "Sorry, we can't start this dish until we have all 4." Meanwhile, the 3 available blenders sit idle, and the kitchen is under-utilized.

This paper introduces a new, smarter kitchen manager named Hadar, and an even more advanced version called HadarE. Here is how they work, explained simply:

1. The Problem: The "All-or-Nothing" Bottleneck

Traditional managers look at a job (a dish) and say, "This needs 4 specific tools." If the kitchen doesn't have exactly those 4 tools free at that exact moment, the job waits.

The Analogy: Imagine trying to fit a large sofa into an elevator. If the elevator is too small, you wait. You don't try to disassemble the sofa and send the cushions up one by one, even though that would get the job done faster.

2. The First Solution: Hadar (The Smart Organizer)

Hadar changes the rules. Instead of looking at the whole dish, it looks at the ingredients (the individual tasks).

How it works: Hadar knows that a specific part of a recipe might run well on a high-speed blender, while another part runs fine on a standard one. It breaks the job down into tiny pieces.
The Magic: If you have 3 high-speed blenders and 2 standard ones, Hadar can say, "Let's put the heavy chopping on the high-speed ones and the light mixing on the standard ones."
The Result: No tool sits idle. The kitchen runs at 100% capacity. The paper shows this makes the kitchen 20% more efficient and finishes cooking 20% faster than the old manager.

3. The Second Solution: HadarE (The "Clone Army")

Even with Hadar, there's a limit. If you have 5 chefs (nodes) but only 3 dishes to cook, 2 chefs will still be standing around doing nothing.

The Innovation: HadarE introduces a "Magic Photocopier." It takes a single dish and forks it into multiple copies.
The Analogy: Imagine you have one giant pizza to bake. Instead of putting it on one oven, you slice it into 5 pieces and put each slice in a different oven (even if the ovens are different sizes!).
How it works:
1. Forking: The system splits one training job into 5 copies.
2. Parallel Cooking: All 5 copies cook simultaneously on 5 different machines.
3. The Merge: Once the slices are done, a "Head Chef" (Job Tracker) takes the results, mixes them back together (averaging the weights), and creates one perfect, finished pizza.
The Result: Even if you only have 1 job left, you can use all 5 ovens. This boosts efficiency by 45% and cuts the total cooking time in half (or even by 80% in some cases).

4. Why It's Better Than Just Speed

You might think, "If I split a job and run it on weird, mixed machines, the result might be messy."

The Surprise: The paper found that because HadarE uses the best machines for the heaviest parts of the work, the final "pizza" (the AI model) actually tastes better. The models trained with HadarE had higher accuracy and better quality than those trained with the old rigid methods.

Summary of the "Magic"

Old Way (Gavel): "I need 4 specific tools. If I don't have them, I wait." (Wasted time, wasted tools).
Hadar: "I can use any tool for any part of the job. Let's fill every seat." (Better efficiency).
HadarE: "Let's clone the job so we can use every single tool in the kitchen at once, then glue the results back together." (Maximum speed, maximum efficiency, and surprisingly high quality).

In short, this paper teaches us how to stop waiting for perfect conditions and start using every available resource, no matter how mismatched, to get the job done faster and better.

1. Problem Statement

Deep Learning (DL) training on modern clusters faces two critical inefficiencies:

Lack of Fine-Grained Heterogeneity Awareness: Existing schedulers (e.g., Gavel) often treat jobs at a coarse "job level." They assume a job must run on a specific type of accelerator or a homogeneous set of devices. If a job requires 4 GPUs but the cluster only has 3 V100s and 3 K80s available, the job waits, leading to resource idleness. Different DL models exhibit vastly different performance speeds on different hardware types (e.g., a ResNet-50 may run 10x faster on a V100 than a K80, while other models show less variance).
Sub-optimal Resource Utilization: Traditional schedulers often assign a single job to a single node. If a job finishes early or requires fewer resources than a node provides, the remaining resources on that node (or other idle nodes) go unused, even if the job could theoretically be split across multiple nodes.

2. Methodology

The authors propose a two-stage solution: Hadar (a task-level scheduler) and HadarE (an enhanced version utilizing job forking).

A. Hadar: Task-Level Heterogeneity-Aware Scheduler

Core Concept: Instead of scheduling entire jobs, Hadar schedules individual tasks (iterations/epochs) across both spatial (different machines/GPUs) and temporal (time slots) dimensions. It allows a single job to utilize a mix of heterogeneous accelerators (e.g., V100s, P100s, K80s) simultaneously.
Optimization Framework:
- The problem is formulated as an optimization goal to maximize overall job utility (inverse of completion time) subject to resource constraints.
- Primal-Dual Approach: The authors reformulate the problem into an Integer Linear Program (ILP) and solve it using a Primal-Dual framework.
- Dual Subroutine: A dual subroutine calculates a "price" for resources based on an exponential price function. This price increases as resources are allocated, filtering out low-utility jobs and ensuring high-utility jobs get resources.
- Algorithm: A dynamic programming (DP) algorithm is used to find the optimal allocation ( $s^*$ ) that minimizes the cost (resource price $\times$ allocation) while maximizing job payoff.
Theoretical Guarantees: The algorithm is proven to have polynomial runtime complexity and offers a competitive ratio of $2\alpha$ (where $\alpha$ depends on the ratio of max/min job utilities), guaranteeing solutions within a constant bound of the optimal.

B. HadarE: Resource Utilization Enhancement via Job Forking

Core Concept: To address the limitation where a job is restricted to a single node, HadarE forks every training job into $n$ copies (where $n$ is the number of cluster nodes).
Mechanism:
- Parallel Execution: Multiple copies of the same job run concurrently on different heterogeneous nodes.
- Job Tracker: A central component tracks the progress of all forked copies.
- Aggregation & Consolidation: At the end of each time slot, the Job Tracker aggregates the training steps completed by all copies and consolidates model parameters via weight averaging. This ensures the final model quality remains consistent with standard training.
- Initial Throughput Estimation: Since profiling every job on every node is costly, HadarE uses a derived formula based on the Performance-Memory Index (PMI), batch size, PCIe scaling, and model complexity to estimate initial throughput, refining it as training progresses.

3. Key Contributions

Hadar Scheduler: A novel scheduler that addresses performance heterogeneity at the task level (rather than job level), enabling fine-grained allocation across diverse GPU types.
Primal-Dual Optimization: Development of an optimization algorithm using a dual subroutine to solve the scheduling problem on heterogeneous nodes with proven polynomial complexity and bounded competitive ratios.
HadarE (Enhancement): A strategy to fork jobs into multiple copies for concurrent execution on separate nodes, maximizing cluster resource utilization (CRU) and reducing total training time.
Comprehensive Evaluation: Extensive validation using trace-driven simulations and real-world experiments on physical clusters (AWS and a lab testbed).

4. Experimental Results

The authors evaluated Hadar and HadarE against state-of-the-art schedulers (Gavel, Tiresias, YARN-CS) using real-world DL workloads (ResNet, Transformer, LSTM, etc.) on heterogeneous clusters.

Trace-Driven Simulation (vs. Gavel):
- Total Time Duration (TTD): Hadar reduced TTD by 1.20× compared to Gavel.
- Resource Utilization: Significant improvements in GPU Resource Utilization (GRU).
Physical Cluster Experiments (AWS & Lab Testbed):
- Resource Utilization (CRU):
  - Hadar improved CRU by 1.20× (AWS) and 1.21× (Lab) over Gavel.
  - HadarE achieved a massive 1.45× to 1.62× improvement in CRU over Gavel by keeping all nodes busy via forking.
- Total Time Duration (TTD):
  - Hadar reduced TTD by 1.17× (AWS) and 1.16× (Lab) vs. Gavel.
  - HadarE reduced TTD by 50% (AWS) and 80% (Lab) compared to Gavel.
- Job Completion Time (JCT): HadarE reduced mean JCT by 2.23× to 2.76× compared to Gavel.
- Model Quality: Crucially, HadarE produced models with better inference quality (higher accuracy, lower MSE) than Hadar. For example, the Language Translation model achieved 54.69% accuracy with HadarE vs. 52.41% with Hadar. This is attributed to the ability to utilize the most powerful nodes for more training steps before consolidation.

5. Significance

This work bridges the gap between theoretical resource heterogeneity and practical cluster scheduling.

Efficiency: It demonstrates that moving from job-level to task-level scheduling combined with job forking can drastically reduce training time and eliminate resource idleness in heterogeneous environments.
Quality: It challenges the assumption that faster training (via parallelization) compromises model quality; in fact, the heterogeneous parallel training in HadarE yielded superior model performance.
Scalability: The proposed algorithms are computationally efficient (polynomial time), making them viable for large-scale production clusters with thousands of jobs and diverse hardware.
Practicality: The use of initial throughput estimation formulas allows the system to operate effectively without extensive pre-profiling, making it deployable in dynamic cloud environments like AWS.

Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters

1. The Problem: The "All-or-Nothing" Bottleneck

2. The First Solution: Hadar (The Smart Organizer)

3. The Second Solution: HadarE (The "Clone Army")

4. Why It's Better Than Just Speed

Summary of the "Magic"

1. Problem Statement

2. Methodology

A. Hadar: Task-Level Heterogeneity-Aware Scheduler

B. HadarE: Resource Utilization Enhancement via Job Forking

3. Key Contributions

4. Experimental Results

5. Significance

More like this

NuHF Claw: A Risk Constrained Cognitive Agent Framework for Human Centered Procedure Support in Digital Nuclear Control Rooms

Exploration and Exploitation Errors Are Measurable for Language Model Agents

SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach