Exploiting Dependency and Parallelism: Real-Time… — Plain-Language Explanation

Original authors: Yuanhai Zhang, Songyang He, Ruizhe Gou, Mingyue Cui, Boyang Li, Shuai Zhao, Kai Huang

Published 2026-02-25

📖 5 min read🧠 Deep dive

Original authors: Yuanhai Zhang, Songyang He, Ruizhe Gou, Mingyue Cui, Boyang Li, Shuai Zhao, Kai Huang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: The "Super-Chef" Problem

Imagine you have a Super-Chef (the GPU) in a massive, high-tech kitchen. This chef is incredibly fast and can cook hundreds of dishes at the exact same time using dozens of different stoves (called Streaming Multiprocessors or SMs).

In the world of Artificial Intelligence (AI), this chef is tasked with preparing complex meals (like training a self-driving car's brain). These meals aren't just one big pot of soup; they are a series of steps (kernels) that must happen in a specific order. Some steps depend on others (you can't garnish the soup before you boil it), while others can happen simultaneously (chopping onions and boiling water).

The Problem:
The kitchen is chaotic.

Dependencies: If the chef tries to chop onions while the water is still boiling, they might burn the onions or wait idly.
Resource Contention: If the chef tries to use 100 stoves at once but only has 80, the extra dishes get stuck in a line, causing unpredictable delays.
The "Black Box": The kitchen manager (the hardware scheduler) decides who cooks what and when, but it does so in a way that is hard to predict. Sometimes the chef finishes in 5 minutes; other times, it takes 10 minutes for the exact same meal.

For safety-critical systems (like a self-driving car), we need to know exactly how long the meal will take. If we guess wrong, the car might crash because it didn't finish calculating the route in time.

The Solution: The "Balanced Group" Strategy

The authors of this paper propose a new way to organize the chef's schedule. Instead of letting the kitchen manager decide randomly, they introduce a Master Planner (the software method) that organizes the cooking into Balanced Groups.

Here is how it works, step-by-step:

1. Breaking the Meal into "Courses" (Sub-graph Division)

Imagine the recipe is a giant flowchart. The planner looks at the chart and finds the "choke points"—steps where many different tasks must finish before the next one can start.

The Analogy: Think of a relay race. You can't start the next runner until the previous one crosses the line.
The Fix: The planner breaks the race into distinct "legs." Within each leg, the runners (kernels) are grouped so they can all run at the same time without tripping over each other.

2. Balancing the Load (Parallelism Scaling)

This is the most clever part.

The Old Way: If you have a huge pot of soup (a heavy task) and a tiny cup of tea (a light task), and you give them both the same number of stoves, the soup takes forever, and the tea finishes instantly. The chef sits around waiting for the soup.
The New Way: The planner looks at the size of the task.
- The Soup gets many stoves (high parallelism) to cook it fast.
- The Tea gets fewer stoves (low parallelism) so it doesn't finish too early and waste resources.
The Result: The soup and the tea finish at almost the exact same time. The chef never has to wait around. This is called Balancing.

3. The "Cut-and-Paste" Trick (Node Segmentation)

Sometimes, a single task is just too big to fit in the spare stoves available.

The Analogy: Imagine you have a giant pizza that needs to be cooked, but you only have space for half a pizza on your oven.
The Fix: The planner slices the pizza in half. It cooks the first half right now, and saves the second half to cook immediately after. This ensures the oven is never empty, but the order is strictly controlled.

4. Adding "Traffic Lights" (Extra Dependencies)

In a normal kitchen, the manager might let the pizza go before the salad is ready, causing chaos.

The Fix: The planner adds invisible "traffic lights" (extra dependencies) between the groups. It forces Group A to finish completely before Group B starts.
Why? This removes the randomness. Even if the hardware scheduler is chaotic, the "traffic lights" ensure the groups run in a strict, predictable line.

The Results: Why It Matters

The researchers tested this method on synthetic data and real-world benchmarks (like math problems used in AI).

Predictability: They could calculate the maximum time the task would take with 100% certainty. No more guessing.
Speed: By balancing the load and keeping the stoves busy, they reduced the total time it took to finish the "meal" by up to 32.8% compared to standard methods.
No New Hardware: The best part? They didn't need to buy a new kitchen or build a new oven. They just changed the recipe book (the software code) using standard tools that already exist.

Summary Metaphor

Think of the GPU as a busy highway.

Old Method: Cars (tasks) enter the highway at random speeds. Big trucks (heavy tasks) clog the lanes, and small cars (light tasks) zip by, leaving empty lanes unused. Traffic jams happen unpredictably.
New Method: A traffic controller groups the cars into convoys.
- They slow down the fast cars and speed up the slow trucks so everyone in the convoy moves at the same pace.
- They break giant trucks into smaller trailers if the lane is too narrow.
- They enforce strict entry times so convoys don't crash into each other.

The Result: The highway flows smoothly, traffic jams disappear, and we can predict exactly when the last car will arrive. This makes the system safe enough for self-driving cars and life-saving medical devices.

1. Problem Statement

The rapid adoption of GPUs in safety-critical domains (e.g., autonomous driving, avionics) necessitates both high computational efficiency and strict timing predictability. However, current GPU scheduling and analysis face three primary challenges:

Unpredictable Execution Delays: While GPUs support massive parallelism, complex data dependencies and resource contention (e.g., Streaming Multiprocessor or SM saturation) within a Directed Acyclic Graph (DAG) of kernels often lead to unpredictable execution times.
Limitations of Existing Analysis: Most timing analysis methods assume sequential execution or single-kernel tasks. They fail to account for inter-kernel dependencies or the non-deterministic nature of hardware schedulers, leading to overly pessimistic worst-case bounds.
Ineffective Concurrency Management: Existing approaches (like standard CUDA streams) often rely on priority mechanisms that are unreliable on GPUs due to the "black-box" nature of hardware scheduling. Furthermore, naive parallelism assignment (e.g., Greedy approaches) often causes resource contention, negating the benefits of parallel execution.

Objective: To develop a scheduling and timing analysis framework for DAG-structured GPU tasks that minimizes the worst-case makespan while providing a safe, non-pessimistic, and predictable upper bound on execution time, without relying on hardware-specific priority assumptions.

2. Methodology

The proposed framework operates entirely within the standard CUDA API (using Streams, Events, and CUDA Graphs) and consists of three core phases:

A. Sub-graph Division (Balanced Grouping)

The DAG is decomposed into a sequence of disjoint, balanced groups ( $\Pi$ ) to manage dependencies and parallelism:

Block Formulation: The DAG is partitioned into blocks based on "join nodes" (nodes with multiple predecessors). These blocks are ordered by cumulative ancestor workload to ensure topological correctness.
Balanced Group Construction: Within each block, concurrent nodes are extracted and grouped into sets ( $\pi_j$ $π_{j}$ ) such that:
- The total number of required SMs does not exceed the GPU's capacity ( $M$ ).
- The execution times of nodes within a group are balanced to prevent one slow node from delaying the entire group.

B. Scheduling Mechanism

The scheduler optimizes execution within these groups using three specific techniques:

Parallelism Scaling: Instead of assigning maximum parallelism to every kernel, the method scales the parallelism ( $m_i$ $m_{i}$ ) of each node proportionally to its computation load relative to the group's total load. This ensures balanced finish times and prevents resource contention.
- Formula: $m_i = \min(m_i^{max}, \text{round}(\frac{\hat{C}_i}{W(\pi_j)} \cdot M))$
Opportunistic Launching: If a balanced group does not fully utilize the GPU, the scheduler identifies "released" nodes from subsequent groups and launches them on the spare SMs to maximize utilization.
Node Segmentation: If a node selected for opportunistic launching is too large to fit in the remaining capacity, it is split into two segments:
- A parallel segment that executes immediately with the current group.
- A residual segment that is deferred to a later group.
Extra Dependency Injection: To guarantee predictability, the scheduler inserts artificial dependencies (edges) between groups. This forces a strict sequential execution order between balanced groups, eliminating the non-determinism of the hardware scheduler.

C. Timing Analysis

Based on the enforced sequential order of groups, the paper derives a safe upper bound for the DAG makespan:

Lemma: The response time of a balanced group is bounded by the maximum execution time of any single node within that group (since resource contention is eliminated by scaling).
Theorem: The total worst-case makespan is the sum of the response times of all sequential groups.
Key Advantage: This analysis requires no assumptions about node-level priorities or hardware scheduling behavior, making it robust and non-pessimistic.

3. Key Contributions

Integrated Scheduling Scheme: A novel approach combining parallelism scaling, node segmentation, and extra dependency mechanisms specifically designed for DAG-structured GPU tasks to reduce makespan.
Priority-Free Timing Analysis: A theoretical framework that provides a safe, non-pessimistic worst-case makespan bound without assuming specific kernel priorities, addressing a gap in existing literature.
Implementation Feasibility: The method is implemented using standard CUDA APIs (Streams, Events, CUDA Graphs), requiring no modifications to the GPU hardware or driver software.
Performance Validation: Extensive experimental results demonstrating significant improvements over existing methods.

4. Experimental Results

The authors evaluated the method using both synthetic benchmarks (1,000+ DAGs) and real-world applications (Laplace, Gaussian Elimination, Stencil) on NVIDIA RTX 3060 and Jetson Orin Nano.

Worst-Case Makespan Reduction:
- Compared to the Graham_para method (a parallel adaptation of Graham's bound), the proposed method reduced the worst-case makespan by up to 32.8% (average 18.9–28.7% depending on SM count).
- Compared to the Greedy approach, improvements were up to 21.2% in specific configurations.
- The method showed increasing advantages as the number of SMs ( $M$ ) and DAG depth increased, where resource contention is most severe.
Measured Execution Time:
- In real-world benchmarks, the proposed method reduced the measured task execution time by up to 21.3% (average 15.0–21.3%).
- It also demonstrated higher stability, with lower standard deviations in execution times compared to the Greedy approach, due to the enforced execution order.

5. Significance

This paper addresses a critical bottleneck in real-time GPU computing: the trade-off between parallelism and predictability.

Safety-Critical Applicability: By providing a deterministic makespan bound without relying on hardware black-box behaviors, this work enables the deployment of complex AI/GPU workloads in safety-critical systems where timing guarantees are mandatory.
Resource Efficiency: The technique of "parallelism scaling" and "node segmentation" ensures that GPU resources are utilized efficiently without causing the unpredictable delays associated with resource contention.
Practicality: Since the solution relies solely on standard CUDA APIs, it can be immediately adopted by developers without waiting for new hardware generations or proprietary driver support.

In summary, the paper presents a robust framework that transforms the unpredictable nature of GPU execution into a predictable, optimized, and analyzable system, significantly outperforming current state-of-the-art scheduling and analysis methods.

Exploiting Dependency and Parallelism: Real-Time Scheduling and Analysis for GPU Tasks