GPU Acceleration and Portability of the TRIMEG Code for… — Plain-Language Explanation

The Big Picture: Cooking a Cosmic Storm

Imagine trying to predict the weather inside a star. In the real world, we can't just stick a thermometer inside the sun or a fusion reactor; it's too hot and chaotic. Instead, scientists use super-computers to run "virtual simulations" of plasma (super-hot, electrically charged gas).

The TRIMEG code is a specific, very sophisticated recipe for simulating this plasma. It tracks billions of tiny particles (like individual grains of sand in a storm) to see how they swirl, crash, and create turbulence. The problem? This recipe is incredibly heavy. Running it on a standard computer (CPU) is like trying to move a mountain with a single spoon. It takes too long.

The Goal: The author, Giorgio Daneri, wanted to speed this up by using GPUs (Graphics Processing Units). Think of a CPU as a single master chef who is very smart but can only chop one vegetable at a time. A GPU is like a kitchen with 10,000 sous-chefs who can all chop vegetables simultaneously. The thesis is about figuring out how to get that single master chef's recipe to work perfectly with an army of 10,000 sous-chefs, and doing it in a way that works for two different brands of kitchens (NVIDIA and AMD).

The Challenge: The "Universal Translator" Problem

The author chose a tool called OpenMP to do the translation. Think of OpenMP as a universal translator that tells the computer, "Hey, take this part of the recipe and give it to the GPU."

However, the author ran into two major hurdles:

The "Compiler" Glitch: The software that translates the code (the compiler) wasn't perfect. It was like trying to use a universal translator that sometimes forgets how to say "salt" or "heat." The author had to rewrite parts of the code to fit the translator's quirks. For example, the code used advanced "polymorphism" (a fancy way of saying objects that can change their shape or identity). The translators (compilers) for the GPUs didn't understand this shape-shifting, so the author had to flatten the shapes into rigid boxes to make them work.
The "Traffic Jam": Moving data between the main computer (CPU) and the GPU (the sous-chefs) is slow. If you keep stopping to hand ingredients back and forth, the sous-chefs sit idle. The author had to restructure the code so that all the ingredients were moved to the GPU once at the start, rather than constantly shuttling them back and forth.

The Solution: Restructuring the Kitchen

To make the code run on both NVIDIA and AMD GPUs, the author had to perform some "surgery" on the TRIMEG code:

Flattening the Map: The code used a complex map to find where particles were. This map was like a messy filing cabinet. The author flattened it into a single, straight list so the GPU could read it instantly without getting lost.
Fixing the "Race": Sometimes, when thousands of sous-chefs try to write on the same whiteboard at the same time, they scribble over each other (a "race condition"). The author found spots where the code was doing this and fixed it so everyone wrote in their own lane.
The "One-Size-Fits-All" Compromise: Because the two GPU brands (NVIDIA and AMD) speak slightly different languages, the author had to create a single code version that works for both, even if it meant using some "workarounds" (like using a specific type of memory allocation that works for both, even if it's not the absolute fastest for one of them).

The Results: Did it Work?

The author tested the new GPU version against the old CPU version using two famous "test cases" (like standard driving tests for a new car):

The Cyclone Case: A simplified simulation of plasma turbulence.
The TCV-X21 Case: A more complex, realistic simulation involving the edge of the plasma.

The Verdict:

Speed: The GPU version was significantly faster. In some tests, it was nearly 30 times faster than the CPU version when running on a single machine.
Accuracy: The results from the GPU matched the CPU results almost perfectly. The "weather patterns" (energy growth and turbulence structures) looked the same.
Portability: The code successfully ran on both NVIDIA and AMD hardware without needing to be completely rewritten for each one.

The Catch (Limitations)

The author is honest about the limitations:

The "Translator" isn't perfect yet: The compilers (the software that turns code into machine language) for these GPUs are still maturing. Sometimes they produce slightly different math results than the CPU, which can cause tiny errors over time.
Hardware Mismatch: If you have a computer with a lot of CPU cores but only one GPU, the GPU might get overwhelmed if you try to feed it too many tasks at once. The author found that for the best results, you need to balance how many "chefs" (MPI processes) you have versus how many "sous-chefs" (GPU threads) are available.
No "Magic Bullet": While the particle-moving part of the code got a massive speed boost, other parts of the simulation (like solving the magnetic field equations) still run on the CPU because the tools to move those specific parts to the GPU aren't ready yet.

Summary

In short, this thesis is a story of engineering ingenuity. The author took a heavy, slow, complex simulation code and successfully taught it to run on modern, powerful graphics cards. They navigated a minefield of software bugs and compiler limitations to create a version that works on two different types of hardware, proving that we can simulate fusion plasma much faster without losing accuracy. It's a crucial step toward making fusion energy research more efficient, though the journey to a fully automated, perfect translation isn't quite over yet.

Technical Summary: GPU Acceleration and Portability of the TRIMEG Code for Gyrokinetic Plasma Simulations using OpenMP

Problem Statement
Plasma physics simulations, particularly gyrokinetic models used to study instabilities and turbulence in tokamak fusion devices, are computationally intensive. The TRIMEG code, a high-accuracy particle-in-cell (PIC) solver utilizing a C1 finite element method on unstructured triangular meshes, faces significant execution time challenges due to the massive number of particles (often $10^7$ to $10^8$ ) required for realistic simulations. While the code already employs MPI for multi-node parallelism, the particle pushing and grid-to-particle (G2P) operations constitute the primary bottleneck, consuming up to 80% of the total execution time. The challenge lies in accelerating these specific "hotspots" using Graphics Processing Units (GPUs) while maintaining portability across different hardware architectures (specifically NVIDIA and AMD) and preserving the code's complex object-oriented structure, which includes polymorphism and derived types.

Methodology
The study focuses on porting the TRIMEG code to GPU architectures using the OpenMP offloading API (version 4.0 and later). The methodology involved:

Target Selection: The particle pusher kernel and associated G2P operations (pullback, density calculation, and distribution function interpolation) were identified as the primary targets for offloading due to their high arithmetic intensity and lack of inter-particle dependencies.
Code Restructuring for Portability: Significant restructuring was required to overcome compiler limitations in both amdflang (AMD) and nvfortran (NVIDIA). Key challenges included:
- Polymorphism: Both compilers struggled with class() derived types and type-bound procedures within GPU target regions. The solution involved refactoring the code to use non-polymorphic type() declarations where possible and implementing a workaround for circular dependencies between particle and field classes using base/extended class hierarchies and Fortran INCLUDE directives to duplicate function bodies.
- Dynamic Arrays: The code utilized a custom library mimicking C++ vectors for dynamic arrays. Since GPU kernels cannot handle dynamic allocation or complex pointer indirections easily, the mapping structure between bounding boxes and mesh triangles was "flattened" from an array of structures into 1D arrays to facilitate efficient memory transfers.
- Memory Management: Preemptive memory allocation was implemented during the initialization phase to minimize runtime latency. For AMD platforms, Unified Shared Memory (USM) was leveraged where possible, while explicit enter data, update, and exit data directives were used for NVIDIA platforms lacking USM support.
- Numerical Consistency: To ensure the GPU results matched the CPU reference, compiler optimizations that altered floating-point arithmetic (such as Fused-Multiply-Add instructions) were disabled (-ffp-contract=off for AMD, -Mnofma for NVIDIA). Race conditions in the external B-spline library were resolved by switching from shared object members to locally declared automatic arrays.
Performance Evaluation: The implementation was tested on the Viper cluster (AMD MI300A), Raven (NVIDIA A100), and Pitagora (NVIDIA H100) clusters. Performance was evaluated through:
- Kernel Profiling: Using rocprof-compute and nsys to analyze resource occupancy, memory bandwidth, and instruction mixes.
- Scalability Studies: Strong scaling tests were conducted to assess the efficiency of hybrid MPI-OpenMP offloading, specifically examining the impact of oversubscribing GPUs with multiple MPI processes.
- Grid Size Exploration: Tuning the number of OpenMP teams and threads per team to maximize hardware utilization.

Key Contributions

First Cross-Vendor Port: This work presents a pioneering effort in porting a complex, object-oriented Fortran gyrokinetic code to both NVIDIA and AMD GPUs using a single codebase via OpenMP offloading.
Compiler Workarounds: The thesis documents specific, non-trivial workarounds for compiler limitations regarding polymorphism, dynamic arrays, and procedure pointers in OpenMP target regions. It highlights the lack of comprehensive documentation for nvfortran and amdflang regarding these features.
Hybrid Parallelization Analysis: The study provides a detailed analysis of the trade-offs in hybrid MPI-OpenMP offloading, demonstrating that while GPU acceleration is effective for the particle pusher, the lack of OpenMP multithreading in the original code necessitates oversubscription, which can limit scalability on nodes with high core counts but limited GPU resources.
Numerical Verification: The implementation includes a rigorous verification process comparing energy growth rates and 2D mode structures against CPU results, confirming that the GPU version reproduces physics with high fidelity despite minor numerical deviations caused by compiler-specific floating-point handling.

Results

Speedup: For a realistic workload of $32 \times 10^6$ electrons, the GPU implementation achieved a speedup of approximately 14.8x on the AMD Viper node and 29.6x on the NVIDIA Pitagora node compared to the GCC-compiled CPU version on the TOK cluster.
Kernel Efficiency: The particle pusher kernel accounted for the majority of the runtime. Profiling on the AMD MI300A showed high arithmetic intensity with 80%+ L1/L2 cache hit rates, though only 18% of memory accesses were coalesced.
Scalability Limitations: Strong scaling tests revealed that while the GPU-accelerated portion scales well, the overall application speedup is constrained by the non-accelerated portions (e.g., field solvers using PETSc) and the overhead of oversubscribing GPUs. On the NVIDIA Pitagora cluster, multi-GPU support via OpenMP was found to be non-functional in the tested compiler version (nvfortran 24.9), limiting the ability to utilize all available GPUs on a node simultaneously.
Correctness: Simulations of the Cyclone case (ITG mode) and the TCV-X21 case (nonlinear ITG instability) confirmed that the GPU version correctly reproduces the energy growth rates and mode structures observed in the CPU version, with differences attributed to random number generator initialization and compiler-specific floating-point variations rather than algorithmic errors.

Significance and Claims
The paper claims that while OpenMP offloading offers a promising path for portability between different HPC architectures, it is not a "seamless" solution for complex legacy codes. The work demonstrates that achieving a working, high-performance GPU version requires extensive compiler exploration and significant code restructuring to bypass current limitations in compiler support for advanced Fortran features.

The authors emphasize that the success of this portability depends heavily on the specific compiler version rather than just the programming paradigm. They conclude that the TRIMEG GPU implementation is a functional and accurate tool for gyrokinetic simulations, capable of delivering substantial speedups for the most computationally expensive parts of the code. However, they modestly note that the full potential of the hardware (particularly multi-GPU nodes) is currently hindered by immature compiler support for multi-device offloading and the lack of OpenMP multithreading in the underlying CPU code structure. The work serves as a practical guide and a "surrogate documentation" for others attempting similar ports of complex Fortran codes to heterogeneous architectures.

GPU Acceleration and Portability of the TRIMEG Code for Gyrokinetic Plasma Simulations using OpenMP