Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores

Imagine you are trying to predict a massive tsunami hitting a coastline. To do this accurately, you need to simulate how water moves, how sound travels through the ocean, and how the sea floor shifts during an earthquake. This isn't just a simple calculation; it's a giant, complex puzzle made of millions of tiny pieces (called "finite elements") that must be solved with extreme precision.

If you get the math slightly wrong, the prediction could be useless. That's why scientists use Double Precision (FP64) math—think of it as using a ruler with microscopic markings instead of a standard tape measure. But here's the problem: doing this level of detail on a supercomputer is incredibly slow and energy-hungry. It's like trying to paint a masterpiece with a tiny brush while running a marathon.

This paper is about a team of engineers who found a way to make that marathon run twice as fast and use half the energy, without losing a single drop of precision. Here is how they did it, explained simply:

1. The Problem: The "Heavy Lifting" Bottleneck

In the past, computers had two types of workers:

The General Workers (CUDA Cores): These are like a team of general contractors. They are great at doing many different tasks, but when it comes to heavy lifting (multiplying big grids of numbers), they have to carry the materials one by one. They get tired (slow) because they spend more time walking back and forth to get materials than actually building.
The Specialized Workers (Tensor Cores): These are like a team of forklifts. They are designed to move huge pallets of materials at once. For years, these forklifts could only handle "rough" materials (low precision). If you needed "microscopic precision" (Double Precision), you couldn't use the forklifts; you had to use the general contractors, which was slow.

2. The Breakthrough: New Forklifts for Microscopic Precision

NVIDIA recently upgraded their "forklifts" (Tensor Cores) so they can now handle Double Precision materials. However, just having the new forklifts isn't enough. The way the construction site was organized (the software code) was still designed for the old general contractors.

The team realized that the "construction site" (the math for the tsunami simulation) was full of small, repetitive tasks. They decided to reorganize the work so the new Double Precision Forklifts could do the heavy lifting.

3. The Strategy: "Fusion" and "Re-arranging"

To make the forklifts work perfectly, they did two clever things:

The "Fusion" Trick (Kitchen Analogy): Imagine you are making a sandwich.
- Old Way: You get the bread, put it on the counter. Then you get the cheese, put it on the counter. Then you get the ham. You walk back and forth to the fridge four times.
- New Way (Fusion): You open the fridge, grab the bread, cheese, and ham all at once, and make the sandwich in one smooth motion.
- In the paper: They combined several small math steps into one giant step. This meant the computer didn't have to stop and "walk" to memory as often.
The "Traffic Jam" Fix (Bank Conflict Analogy):
- Imagine a parking garage with 32 lanes. If 32 cars try to park in the same lane at the same time, they crash and have to wait in line. This is called a "bank conflict."
- The team figured out a new parking map. They told the cars (data) exactly which lane to use so that no two cars ever tried to park in the same spot at the same time. This kept the traffic flowing perfectly smooth.

4. The Results: Speed and Savings

By using these new forklifts and fixing the traffic, the results were amazing:

Speed: The simulation ran 2 times faster. A task that used to take 10 hours now takes 5.
Energy: Because the computer finished the work faster and didn't waste energy waiting, it used up to 83% less energy for the same job.
Scale: They tested this on the Alps supercomputer in Switzerland, which has nearly 10,000 of these powerful chips working together. The system scaled perfectly, meaning adding more computers made it faster without any slowdowns.

Why Does This Matter?

The ultimate goal of this research is Real-Time Tsunami Warning.
Currently, if an earthquake happens, it might take hours to calculate if a tsunami is coming. With these new optimizations, that calculation could happen in seconds.

This means that when the ground shakes, a "Digital Twin" of the ocean can instantly predict the wave height and tell coastal cities to evacuate before the water even arrives. This isn't just about faster math; it's about saving lives by turning a slow, theoretical calculation into a real-time emergency tool.

In a nutshell: They took a supercomputer, gave it a new type of high-precision engine, tuned the transmission so the gears shifted perfectly, and turned a slow, energy-guzzling simulation into a lightning-fast, life-saving machine.

Here is a detailed technical summary of the paper "Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores."

1. Problem Statement

High-order finite element (FE) simulations are critical for scientific applications ranging from automotive design to tsunami modeling. However, achieving the necessary resolution for practical insights requires massive computational power.

The Bottleneck: While many FE codes have been ported to GPUs, they often remain limited by memory bandwidth rather than compute capacity. Traditional CUDA core implementations of FE operators involve frequent loading and storing of data to shared memory, resulting in a low FLOP-to-byte ratio (approx. 0.11 FLOP/byte).
Precision Constraints: Many scientific applications, particularly inverse problems like tsunami forecasting, require double precision (FP64) for stability and accuracy. Historically, FP64 tensor cores were underutilized because standard libraries (like cuBLAS) focus on large matrix multiplications (GEMM), whereas FE kernels often involve many small, irregularly shaped matrix operations that do not map efficiently to standard APIs.
The Gap: There was a lack of direct programming techniques to leverage FP64 Tensor Cores (specifically the DMMA instruction) for small-matrix, high-order FE kernels on modern NVIDIA architectures (GH200 and GB200).

2. Methodology

The authors optimized the MFEM library (a scalable open-source FE library) by directly programming FP64 Tensor Cores and applying kernel fusion techniques.

A. Application Context

The optimizations were validated on a tsunami early warning digital twin. This application solves coupled acoustic-gravity wave propagation equations using high-order finite elements. The core computational cost lies in the repetitive application of the time-stepping operator, which involves decomposing the FE operator into tensor contractions (Sum Factorization).

B. Core Optimizations

Direct FP64 Tensor Core Programming (DMMA):
- Instead of using standard CUDA cores, the authors implemented the Double-Precision Matrix-Multiply-Accumulate (DMMA) instruction available on NVIDIA Ampere, Hopper, and Blackwell architectures.
- Data Sharing: In the original CUDA kernel, threads redundantly loaded data. With DMMA, a warp (32 threads) collectively loads unique elements of input matrices $A$ and $B$ , reducing shared memory traffic by a factor of ~4.6x.
- Bank Conflict Avoidance: The authors designed specific mapping functions ( $f_m, f_n, f_k$ ) to align memory accesses with the 32-bank shared memory architecture, eliminating bank conflicts during the loading of $8 \times 8 \times 4$ (m8n8k4) blocks.
- Index Reordering: To further avoid conflicts, tensor contraction indices were reordered so that the summation index is always the fastest-changing index in memory.
Kernel Fusion:
- The authors fused multiple sub-kernels (specifically the $G^T B^T D B G$ sequence) into single kernels.
- Fused PA (Partial Assembly): Reduces data movement by performing $D$ and $D^T$ operations within the same kernel, avoiding intermediate storage.
- Fused MF (Matrix-Free): Eliminates storage of quadrature data entirely, computing basis functions and geometry mappings on-the-fly. This requires careful management of shared memory and constant memory to maintain high GPU occupancy.
Hardware Targets:
- Experiments were conducted on NVIDIA Grace Hopper GH200 and Grace Blackwell GB200 Superchips.

3. Key Contributions

First Direct FP64 Tensor Core Programming for Large-Scale FE: This is the first known instance of directly programming FP64 tensor cores to accelerate a complex, PDE-based HPC application (specifically high-order FE kernels), rather than relying on standard GEMM libraries.
Optimization for Irregular Matrices: Developed detailed design patterns for mapping irregular, small-matrix FE operations (e.g., $25 \times 5 \times 4 $) to the fixed$ 8 \times 8 \times 4$ DMMA instruction shape without performance loss.
Energy Efficiency Analysis: Provided a comprehensive analysis of energy efficiency (Performance per Watt) for FP64 tensor cores in small-matrix workloads, a metric previously unreported for this specific use case.
Exascale Scalability: Demonstrated the scalability of these optimized kernels on the Alps supercomputer (CSCS), scaling to nearly 10,000 GPUs.

4. Results

Single-GPU Performance (GH200 & GB200)

Throughput: The DMMA-optimized kernels achieved 35% to 59% speedup over original CUDA core kernels in the saturated regime.
Fusion Impact: Combining DMMA with kernel fusion ("DMMA Fused PA") yielded a 2× performance gain compared to the original non-optimized PA kernel.
Energy Efficiency:
- DMMA alone improved performance per Watt by 27% on GH200 and 18% on GB200.
- With kernel fusion, energy efficiency gains reached 83% on GH200 and 72% on GB200.
Utilization: The DMMA pipe utilization increased from 14% (CUDA cores) to 54% (Tensor cores), though the overall speedup was limited by the mismatch between problem shapes and the $8 \times 8 \times 4$ instruction shape.

Extreme Scale Scalability (Alps System)

Weak Scaling: Achieved near-perfect (linear) weak scaling efficiency (~100%) across a 64× increase in nodes (from 144 to 9,216 GH200 GPUs), solving a problem with ~9.28 trillion Degrees of Freedom (DOF).
Strong Scaling: Achieved 86% to 91% strong scaling efficiency over the same 64× node increase.
Comparison: The optimized kernels maintained high efficiency even at the full system size, proving the viability of these techniques for exascale applications.

5. Significance

Scientific Impact: The work directly benefits the 2025 Gordon Bell Prize-winning application for real-time tsunami forecasting, enabling faster time-to-solution for critical disaster prediction.
Hardware Utilization: It unlocks the potential of FP64 Tensor Cores for a broader class of scientific computing problems beyond dense linear algebra, specifically targeting the small-matrix bottlenecks common in high-order FE methods.
Open Source Integration: The optimizations are being integrated into the MFEM library, making these high-performance techniques accessible to the broader scientific community for use in diverse fields like electromagnetics, structural mechanics, and fluid dynamics.
Energy Efficiency: By significantly reducing memory traffic and increasing compute utilization, the approach offers a sustainable path forward for exascale computing, addressing the growing energy constraints of supercomputing.