Electron-phonon physics at the exascale: A hybrid MPI-GPU-OpenMP framework for scalable Wannier interpolation

Imagine you are trying to predict how a material conducts electricity or heat. To do this accurately, you need to understand a complex dance between electrons (the tiny particles carrying charge) and phonons (vibrations in the material's atomic lattice, like sound waves).

For decades, scientists have used a software tool called EPW to calculate this dance. But there was a major problem: the calculations were like trying to count every single grain of sand on a beach by hand. Even with the world's fastest supercomputers, the task was so slow that it became impossible to study large, complex materials.

This paper introduces a massive upgrade to EPW, turning it from a "hand-counting" tool into a "satellite-speed" machine. Here is how they did it, explained simply:

1. The Problem: The "Grain of Sand" Bottleneck

The old version of EPW (v5.9) was like a single, very hardworking librarian trying to organize a library. It used a method called MPI parallelization, which is like hiring more librarians. However, these librarians kept stopping to talk to each other, check the same books, and manage paperwork (input/output). As they added more librarians, the time spent talking and managing paperwork grew so large that adding more people actually made the job slower. They hit a "speed limit" where they couldn't scale up.

2. The Solution: A Three-Layer Team (MPI-GPU-OpenMP)

The new version (v6.1) introduces a hybrid strategy that combines three different types of workers to tackle the problem:

The Managers (MPI Images): Instead of one big group, they split the work into separate "images" (like different branches of a library). Each branch handles a chunk of the work independently without needing to talk to the others constantly. This eliminates the "talking overhead."
The Specialists (GPUs): The most tedious part of the job involves doing the same math over and over again. The old system used standard computer processors (CPUs) for this. The new system offloads this heavy lifting to GPUs (Graphics Processing Units).
- Analogy: If the CPU is a single chef chopping vegetables slowly but carefully, the GPU is a massive industrial food processor that can chop thousands of vegetables in a second. The new code realizes that the "food processor" is perfect for this specific task.
The Assistants (OpenMP Threads): Inside each branch, instead of having one worker do everything, they use OpenMP to split tasks among multiple cores on the same machine. It's like having one chef with four arms instead of just two.

3. The "Smart" Strategy: Reusing Data

A key innovation in this paper is how they handle memory.

The Old Way: Every time a worker needed a piece of data, they would run to the storage room, grab it, use it, and put it back. This running back and forth wasted time.
The New Way: The team realized that the "ingredients" (the coarse-grid data) are the same for every step of the calculation. So, they load the ingredients onto the GPU once at the start and leave them there. The GPU then cooks the entire meal without ever leaving the kitchen. This saves a massive amount of time.

4. The Results: From Hours to Minutes

The team tested this new system on three of the world's most powerful supercomputers (Vista, Perlmutter, and Aurora).

Speed: They achieved a 29-fold speedup. A calculation that used to take 29 hours now takes just 1 hour.
Scalability: The system works perfectly even when using thousands of computers at once. It scales almost perfectly, meaning if you double the computers, you halve the time.
The "Impossible" Test: The ultimate test was a material called Stanene Nanoribbons (a type of tin-based material used in future electronics). The largest version had 98 atoms in a tiny strip. The old software couldn't even start this calculation because the memory required was too huge. The new software solved it in minutes, revealing new physics about how electricity flows through these tiny, topological wires.

Why Does This Matter?

This isn't just about making code faster; it's about unlocking the future.

New Materials: Scientists can now design better batteries, solar cells, and quantum computers by simulating materials that were previously too complex to study.
AI and Big Data: Because the calculations are so fast, researchers can now generate massive datasets to train Artificial Intelligence models to discover new materials automatically.
Exascale Ready: This work proves that the software is ready for the next generation of "Exascale" supercomputers (machines capable of quintillions of calculations per second).

In a nutshell: The authors took a slow, clunky process, organized the workers better, gave them super-fast tools (GPUs), and taught them to stop wasting time running back and forth. The result is a tool that can now solve physics problems that were previously considered impossible.

Here is a detailed technical summary of the paper "Electron-phonon physics at the exascale: A hybrid MPI-GPU-OpenMP framework for scalable Wannier interpolation."

1. Problem Statement

The interaction between electrons and phonons (e-ph) is fundamental to understanding material properties such as electrical conductivity, superconductivity, and optical absorption. Calculating these interactions requires evaluating e-ph matrix elements ( $g_{mn\nu}(\mathbf{k}, \mathbf{q})$ ) over extremely dense wavevector grids in the Brillouin zone.

Computational Bottleneck: Direct Density Functional Perturbation Theory (DFPT) calculations on such fine grids are prohibitively expensive.
Current Limitations: The standard approach, Wannier interpolation (implemented in the EPW code), reduces costs but remains computationally demanding. Previous versions (EPW v5.9) relied on single-level MPI parallelization. As supercomputers have shifted toward exascale architectures dominated by GPU accelerators, the existing CPU-centric MPI schemes suffered from:
- Early saturation in strong scaling due to MPI+I/O overheads.
- Inability to fully utilize the massive memory bandwidth and parallelism of modern GPUs.
- Intractability for large-scale systems (e.g., nanoribbons with ~100 atoms) due to memory constraints and lack of scalability.

2. Methodology

The authors developed EPW v6.1, a hybrid MPI-GPU-OpenMP framework designed for performance portability across NVIDIA, AMD, and Intel GPUs. The methodology involves three key technical pillars:

A. Algorithmic Refactoring: Nested-Loop Optimization

The authors analyzed the computational complexity of the Wannier interpolation (Eq. 7). They confirmed that a nested-loop algorithm is significantly more efficient than a single-loop approach for uniform grids.

Strategy: The calculation is split into two parts:
1. Outer Loop ( $\mathbf{q}$ ): A Fourier transform (FT) over phonon Wigner-Seitz vectors ( $R_p$ ) to generate an intermediate quantity $g_{m'n'\kappa\alpha}(\mathbf{R}_e, \mathbf{q})$ .
2. Inner Loop ( $\mathbf{k}$ ): A subsequent FT over electron vectors ( $R_e$ ) and unitary rotations to obtain the final fine-grid matrix $g_{mn\nu}(\mathbf{k}, \mathbf{q})$ .
Complexity Reduction: The nested approach reduces the computational workload by a factor proportional to $1/N_k $(where$ N_k$ is the number of k-points), making it ideal for parallelization.

B. Hybrid Parallelization Strategy

To address scaling limits, the authors introduced a two-level MPI scheme combined with GPU offloading and OpenMP threading:

Image Parallelization (Outer Loop): The $\mathbf{q}$ -loop is distributed across "images" (MPI communicators). This eliminates inter-node communication during the interpolation step, as each image operates independently on its subset of $\mathbf{q}$ points.
Pool Parallelization (Inner Loop): The $\mathbf{k}$ -loop is distributed across "pools" within an image.
GPU Offloading:
- The first step (FT over $R_p$ , Eq. 9) is offloaded to GPUs using General Matrix-Vector Multiplication (GEMV).
- Despite GEMV being memory-bound, the high memory bandwidth of modern GPUs (e.g., NVIDIA A100, Intel Max) provides significant speedups over CPUs.
- Data is kept on the GPU for the entire interpolation step to minimize CPU-GPU transfers.
OpenMP Multithreading: Within each MPI rank, OpenMP threads parallelize the $\mathbf{k}$ -loop, allowing multiple threads to share a single GPU and maximizing CPU core utilization.

C. Performance Portability

The implementation uses OpenACC and OpenMP directive-based offloading models. By leveraging vendor-optimized libraries (cuBLAS for NVIDIA, oneMKL for Intel, rocBLAS for AMD), the code achieves numerical equivalence and high performance across different hardware architectures without rewriting the core logic.

3. Key Contributions

First Exascale-Ready EPW: Successfully ported the core e-ph interpolation module of EPW to a hybrid MPI-GPU-OpenMP framework, enabling calculations on leadership-class supercomputers.
Algorithmic Efficiency: Demonstrated that offloading the memory-bound GEMV operation (Eq. 9) to GPUs is highly effective when combined with a two-level MPI scheme that eliminates I/O bottlenecks.
Scalability: Achieved near-ideal strong scaling up to 1,024 GPU nodes (6,144 GPUs) on the Aurora supercomputer.
Memory Management: Developed a strategy to handle coarse-grid e-ph matrices that exceed single-GPU memory by distributing data across pools while maintaining a full copy across images, enabling the study of systems with hundreds of atoms.

4. Results

The authors benchmarked the new framework on three supercomputers: Perlmutter (NVIDIA), Vista (NVIDIA), and Aurora (Intel).

Speedup: Compared to EPW v5.9 (single MPI), EPW v6.1 achieved 19x to 29x speedup on single nodes.
- EPW v6.0 (two-level MPI only) provided a 3.1–4.7x speedup.
- EPW v6.1 (adding GPU + OpenMP) provided an additional 5.3–6.3x speedup.
Scalability:
- Mid-scale (32 nodes): Near-perfect linear scaling for e-ph interpolation.
- Large-scale (1,024 nodes): The framework maintained high efficiency, completing large-scale calculations in less than 5 minutes.
Large-Scale Application: The framework was applied to hydrogen-passivated zigzag stanene nanoribbons (ZSNRs) with widths up to 19.4 nm (98 atoms per unit cell).
- Previous implementations could not handle the 458 GB coarse-grid matrix required for the widest ribbon.
- EPW v6.1 successfully calculated phonon-limited transport properties, revealing a crossover from metallic to nontrivial temperature-dependent behavior driven by topological edge states.

5. Significance

Enabling Exascale Physics: This work removes the computational barrier preventing the study of electron-phonon interactions in complex, large-scale systems (e.g., topological materials, defects, interfaces) that were previously intractable.
Hardware Agnosticism: The use of directive-based offloading ensures the code is future-proof, capable of running efficiently on diverse exascale architectures (NVIDIA, Intel, AMD).
Scientific Impact: The successful application to stanene nanoribbons demonstrates the ability to link microscopic electronic structure (topological edge states) to macroscopic transport properties with high fidelity.
Community Resource: The code (EPW v6.1) is released under the GNU GPL, immediately available to the community for high-throughput screening and AI/ML dataset generation for next-generation electronics and quantum technologies.

In summary, this paper presents a critical advancement in computational condensed matter physics, transforming electron-phonon calculations from a CPU-bound bottleneck into a highly scalable, GPU-accelerated workflow capable of leveraging the full power of modern exascale supercomputers.