Evaluating Application Characteristics for GPU… — Plain-Language Explanation

Original authors: Mohammad Atif, Meghna Bhattacharya, Mark Dewing, Zhihua Dong, Julien Esseiva, Oliver Gutsche, Matti Kortelainen, Ka Hei Martin Kwok, Charles Leggett, Meifeng Lin, Aleksei Strelchenko, Vakhang Tsulaia

Published 2026-01-27

📖 6 min read🧠 Deep dive

View on arXiv ↗PDF ↗

CC BY 4.0

Original authors: Mohammad Atif, Meghna Bhattacharya, Mark Dewing, Zhihua Dong, Julien Esseiva, Oliver Gutsche, Matti Kortelainen, Ka Hei Martin Kwok, Charles Leggett, Meifeng Lin, Aleksei Strelchenko, Vakhang Tsulaia, Brett Viren, Tianle Wang, Haiwang Yu

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a chef trying to cook a massive banquet. You have three different types of high-powered ovens in your kitchen: one made by NVIDIA, one by AMD, and one by Intel. Each oven cooks food differently, uses different knobs, and requires different recipes to work at its best.

If you write your recipe specifically for the NVIDIA oven (using a language called CUDA), you can't just put that same recipe into the AMD or Intel ovens. You'd have to rewrite the whole thing. This is a problem because you don't always know which oven you'll have in your kitchen tomorrow.

To solve this, the paper discusses "portability layers." Think of these as universal translators or smart adapters. They let you write one master recipe that the translator converts into the specific language each oven understands. The paper looks at several of these translators (like Kokkos, SYCL, OpenMP, and Alpaka) to see which one is the best fit for different kinds of cooking tasks.

Here is what the authors found when they tested these translators with real "recipes" from high-energy physics experiments (like those used to study subatomic particles):

1. The "Startup Time" Problem

Turning on a GPU (the oven) isn't instant. It takes a few milliseconds to wake up and get ready.

The Issue: Some translators are slow to start the cooking process. For example, Kokkos can add a significant delay when using AMD ovens. If your cooking task is very short (like boiling an egg for 10 seconds), and the translator takes 5 seconds just to start the stove, you've wasted half your time.
The Lesson: If your tasks are tiny and quick, avoid translators that make the startup slow.

2. The "Crowded Kitchen" Problem

In a real physics lab, the GPU isn't working alone. It's part of a larger system where many people (threads) are trying to use the oven at the same time.

The Issue: Some translators are bad at handling crowds. Kokkos, for instance, has a rule that says, "Only one person can talk to the oven at a time," which causes a traffic jam if multiple chefs try to launch tasks simultaneously. SYCL is a bit inconsistent; sometimes it lets everyone cook at once, and sometimes it forces them to wait in line, depending on which version of the translator you are using.
The Lesson: If your application needs many people working at once, you need a translator that knows how to manage a busy kitchen without locking the doors.

3. The "Toolbox Compatibility" Problem

Physics recipes often use special tools (libraries like ROOT or Eigen) that help with math and data.

The Issue: Some of these tools don't play nice with the translators. For example, a popular math tool called Eigen often breaks when used with the NVIDIA compiler, which many translators rely on. Also, trying to use two different compilers (one for the CPU and one for the GPU) in the same project is like trying to build a house with two different sets of blueprints that don't match—it makes the construction (building the software) a nightmare.
The Lesson: Before picking a translator, check if your favorite tools will fit inside it.

4. The "Furniture Arrangement" Problem

GPUs love simple, flat layouts. They prefer data to be arranged like a neat row of boxes. However, physics data often comes in complex, messy shapes (like a pile of different-sized suitcases).

The Issue: Translators try to fix this mess by wrapping the data in special containers. While this makes the code portable, it adds "overhead"—like putting every single item in a suitcase before moving it, even if you only need to move one sock. This slows things down. Also, none of the translators are great at handling "jagged" data (rows of different lengths), which is very common in physics.
The Lesson: If your data is complex and messy, the translator might slow you down trying to tidy it up.

5. The "Specialized Tools" Problem

Sometimes you need a specific tool, like a Random Number Generator (RNG) or a Fast Fourier Transform (FFT).

The Issue: Each oven manufacturer has their own super-fast, specialized version of these tools. The universal translators often don't include these specialized versions, or they use their own slower versions. While you can force the translator to use the oven's native tool, it breaks the "portability" because that tool only works on that specific oven.
The Lesson: If you rely heavily on these specific tools, you might have to choose between speed (using the oven's native tool) or portability (using the translator's generic tool).

6. The "Construction Time" and "Moving Day" Problems

Building the Recipe: Some translators make the "cooking time" (compilation time) much longer. For huge projects, using certain translators can make the build process take hours instead of minutes.
Moving the Kitchen: If you build your software for a specific oven (e.g., an NVIDIA V100), it might not work on a newer one (an NVIDIA A100). Some translators require you to build a separate version for every single type of oven you might encounter. This creates a massive logistical headache for distributing the software to different labs.

The Final Verdict

The paper concludes that there is no "perfect" translator.

Kokkos is great for some things but struggles with concurrency and startup times on certain hardware.
SYCL is powerful but can be inconsistent depending on the compiler version.
OpenMP and others have their own strengths and weaknesses regarding how they handle memory and different hardware.

The Takeaway: You can't just pick a translator because it's popular. You have to look at your specific "recipe" (your application). If your code is short and fast, pick a translator with low startup time. If your code is complex and uses many tools, pick one that plays well with those tools.

The authors also note that these technologies are evolving rapidly, like new models of ovens coming out every year. What works best today might change tomorrow, so developers need to keep watching the landscape. In the future, new standards might make these choices easier, but for now, careful testing is the only way to find the right fit.

Evaluating Application Characteristics for GPU Portability Layer Selection

1. The "Startup Time" Problem

2. The "Crowded Kitchen" Problem

3. The "Toolbox Compatibility" Problem

4. The "Furniture Arrangement" Problem

5. The "Specialized Tools" Problem

6. The "Construction Time" and "Moving Day" Problems

The Final Verdict

Technical Summary: Evaluating Application Characteristics for GPU Portability Layer Selection

Problem Statement

Methodology

Key Findings and Results

1. Kernel Launch Latency

2. Concurrency and Thread Pools

3. External Library and Compiler Compatibility

4. Data Structures and Memory Transfers

5. RNGs, FFTs, and Atomics

6. Compilation Time

7. Runtime Provisioning

Significance and Conclusion

Evaluating Application Characteristics for GPU Portability Layer Selection

1. The "Startup Time" Problem

2. The "Crowded Kitchen" Problem

3. The "Toolbox Compatibility" Problem

4. The "Furniture Arrangement" Problem

5. The "Specialized Tools" Problem

6. The "Construction Time" and "Moving Day" Problems

The Final Verdict

Technical Summary: Evaluating Application Characteristics for GPU Portability Layer Selection

Problem Statement

Methodology

Key Findings and Results

1. Kernel Launch Latency

2. Concurrency and Thread Pools

3. External Library and Compiler Compatibility

4. Data Structures and Memory Transfers

5. RNGs, FFTs, and Atomics

6. Compilation Time

7. Runtime Provisioning

Significance and Conclusion

More like this