Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters

This paper introduces Hadar, a task-level heterogeneity-aware scheduler that optimizes resource utilization in deep learning clusters, and its enhanced version HadarE, which employs job forking to achieve significant reductions in training time and improved model inference quality compared to state-of-the-art alternatives.

Abeda Sultana, Nabin Pakka, Fei Xu, Xu Yuan, Li Chen, Nian-Feng Tzeng

Published 2026-03-17
📖 4 min read☕ Coffee break read

Imagine you own a massive, high-tech kitchen (a Deep Learning Cluster) where you need to cook hundreds of different complex dishes (Deep Learning Models) at the same time.

Your kitchen is equipped with a mix of tools: some are super-fast, high-end blenders (V100 GPUs), some are standard blenders (P100 GPUs), and some are older, slower blenders (K80 GPUs).

The problem is that your current kitchen manager (existing schedulers like Gavel) is a bit rigid. They treat every dish as a single, indivisible block. If a recipe needs "4 high-speed blenders" but you only have 3 available, the manager says, "Sorry, we can't start this dish until we have all 4." Meanwhile, the 3 available blenders sit idle, and the kitchen is under-utilized.

This paper introduces a new, smarter kitchen manager named Hadar, and an even more advanced version called HadarE. Here is how they work, explained simply:

1. The Problem: The "All-or-Nothing" Bottleneck

Traditional managers look at a job (a dish) and say, "This needs 4 specific tools." If the kitchen doesn't have exactly those 4 tools free at that exact moment, the job waits.

  • The Analogy: Imagine trying to fit a large sofa into an elevator. If the elevator is too small, you wait. You don't try to disassemble the sofa and send the cushions up one by one, even though that would get the job done faster.

2. The First Solution: Hadar (The Smart Organizer)

Hadar changes the rules. Instead of looking at the whole dish, it looks at the ingredients (the individual tasks).

  • How it works: Hadar knows that a specific part of a recipe might run well on a high-speed blender, while another part runs fine on a standard one. It breaks the job down into tiny pieces.
  • The Magic: If you have 3 high-speed blenders and 2 standard ones, Hadar can say, "Let's put the heavy chopping on the high-speed ones and the light mixing on the standard ones."
  • The Result: No tool sits idle. The kitchen runs at 100% capacity. The paper shows this makes the kitchen 20% more efficient and finishes cooking 20% faster than the old manager.

3. The Second Solution: HadarE (The "Clone Army")

Even with Hadar, there's a limit. If you have 5 chefs (nodes) but only 3 dishes to cook, 2 chefs will still be standing around doing nothing.

  • The Innovation: HadarE introduces a "Magic Photocopier." It takes a single dish and forks it into multiple copies.
  • The Analogy: Imagine you have one giant pizza to bake. Instead of putting it on one oven, you slice it into 5 pieces and put each slice in a different oven (even if the ovens are different sizes!).
  • How it works:
    1. Forking: The system splits one training job into 5 copies.
    2. Parallel Cooking: All 5 copies cook simultaneously on 5 different machines.
    3. The Merge: Once the slices are done, a "Head Chef" (Job Tracker) takes the results, mixes them back together (averaging the weights), and creates one perfect, finished pizza.
  • The Result: Even if you only have 1 job left, you can use all 5 ovens. This boosts efficiency by 45% and cuts the total cooking time in half (or even by 80% in some cases).

4. Why It's Better Than Just Speed

You might think, "If I split a job and run it on weird, mixed machines, the result might be messy."

  • The Surprise: The paper found that because HadarE uses the best machines for the heaviest parts of the work, the final "pizza" (the AI model) actually tastes better. The models trained with HadarE had higher accuracy and better quality than those trained with the old rigid methods.

Summary of the "Magic"

  • Old Way (Gavel): "I need 4 specific tools. If I don't have them, I wait." (Wasted time, wasted tools).
  • Hadar: "I can use any tool for any part of the job. Let's fill every seat." (Better efficiency).
  • HadarE: "Let's clone the job so we can use every single tool in the kitchen at once, then glue the results back together." (Maximum speed, maximum efficiency, and surprisingly high quality).

In short, this paper teaches us how to stop waiting for perfect conditions and start using every available resource, no matter how mismatched, to get the job done faster and better.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →