Mixed precision solvers with half-precision floating… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: Solving a Cosmic Puzzle with a Calculator

Imagine you are trying to solve a massive, incredibly complex puzzle that describes how the smallest building blocks of the universe (quarks and gluons) stick together. This is called Lattice QCD (Quantum Chromodynamics).

To solve this puzzle, scientists use supercomputers. The problem is that these puzzles are so huge that the computers get tired and slow down. Usually, to get the answer right, the computer has to use "Double Precision" math—think of this as using a gold-plated, high-end calculator that can handle numbers with extreme detail. It's accurate, but it's slow and heavy to carry around.

Recently, computer chips (specifically the A64FX processor in Japan's "Fugaku" supercomputer) have gotten superpowers. They can now do math with "Half Precision" numbers—think of this as using a lightweight, pocket-sized calculator. It's much faster and uses less energy, but it's prone to making mistakes if the numbers get too small or too big.

The Goal: The authors wanted to use the fast, lightweight calculator (Half Precision) to solve the cosmic puzzle, but they needed a way to keep the answers accurate.

The Problem: The "Underflow" Trap

The researchers tried to use the lightweight calculator directly, but they hit a wall.

Imagine you are trying to measure the distance between two grains of sand. If you use a ruler that only measures in whole meters, you can't see the tiny gap. In computer terms, this is called underflow.

When the math gets very precise (very small numbers), the lightweight calculator (FP16) gets confused. It thinks the number is so tiny that it's actually zero. When this happens, the calculation breaks, the computer gets stuck in a loop, and the puzzle never gets solved.

In the paper, they found that if they just tried to use the fast calculator without help, the solver would "stall" or take forever because it kept losing track of the tiny details.

The Solution: The "Rescaling" Trick

To fix this, the authors invented a clever trick called Rescaling.

Think of it like this:
Imagine you are trying to count a pile of dust motes (tiny particles) using a scale that only works for heavy rocks.

The Problem: If you put a dust mote on the scale, it reads "0." You can't count it.
The Trick: Before you weigh the dust, you put it in a giant, heavy box. Now, the box is heavy enough for the scale to read.
The Calculation: You do the math with the heavy box.
The Result: Once you have the answer, you mentally subtract the weight of the box to get the weight of the dust.

In the paper, they do this mathematically:

Scaling Up: Before the computer does the hard math with the tiny numbers, they multiply everything by a big number (like 128 or 4096). This pushes the numbers into a "safe zone" where the lightweight calculator can see them clearly.
Scaling Down: After the math is done, they divide the result by that same big number to get the correct, tiny answer.

They applied this trick in two places:

The Outer Loop: The main "check" of the solution.
The Inner Loop: The deep, detailed work inside the solver.

The Results: Fast and Accurate

After applying this "Rescaling" trick, the results were amazing:

Stability: The solver stopped crashing. It didn't get stuck in loops anymore.
Speed: The lightweight calculator was twice as fast as the standard "single precision" (medium speed) calculator and three times faster than the heavy "double precision" calculator.
Accuracy: Even though they used the fast, simple calculator, the final answer was just as accurate as if they had used the slow, heavy one. The only cost was a tiny bit more work (about 20% more steps), which was totally worth it for the massive speed gain.

Why This Matters

This paper is like finding a way to drive a race car (the supercomputer) at top speed without blowing the engine.

For the Future: As computers get more powerful but also more specialized for Artificial Intelligence (which loves fast, simple math), being able to use these "lightweight" numbers for complex science is a game-changer.
The Takeaway: You don't always need the most expensive, heavy-duty tools to get the job done. Sometimes, if you use a clever trick (like rescaling), a simple, fast tool can do the job of a giant, slow one.

In short: The authors taught the supercomputer how to use a "pocket calculator" to solve a "universe-sized puzzle" by simply adjusting the volume so the calculator doesn't get confused by the tiny numbers. The result? A solution that is much faster and just as correct.

1. Problem Statement

Lattice Quantum Chromodynamics (LQCD) simulations are computationally intensive, with the most time-consuming component being the iterative solution of large, sparse linear systems (specifically the Wilson fermion matrix). While mixed-precision solvers using Single Precision (FP32) as a preconditioner for Double Precision (FP64) solvers are well-established, the authors investigate the feasibility of using Half Precision (FP16) to further accelerate these simulations.

Key Challenges Identified:

Numerical Instability: Direct application of FP16 in iterative solvers (like BiCGStab) leads to severe numerical instability, primarily due to underflow. As the residual vector norms decrease during iterations, FP16's limited dynamic range (exponent range of $2^{-14}$ to $2^{15}$ ) causes values to collapse to zero, halting convergence.
Algorithmic Failure: Standard mixed-precision algorithms (designed for FP32) fail when naively ported to FP16 on the A64FX processor, resulting in "stagflation" (stagnation) and massive increases in iteration counts.
Hardware Context: The study targets the A64FX processor (used in the Fugaku supercomputer), which features the Scalable Vector Extension (SVE) capable of performing FP16 arithmetic at 4x the throughput of FP64. However, exploiting this requires overcoming the precision limitations.

2. Methodology

The authors propose a rescaling strategy integrated into both the outer iterative refinement loop and the inner BiCGStab solver to maintain numerical stability while utilizing FP16.

A. Algorithmic Modifications

Rescaled Iterative Refinement (Outer Loop):
- Instead of solving $A\tilde{t} = \text{low\_prec}(y)$ directly, the residual vector $r$ is scaled by a factor $\alpha$ such that its norm matches a target value $s$ (close to the maximum representable FP16 value).
- The scaling factor is recalculated in FP32/FP64 after the low-precision solve to correct for rounding errors and ensure the next iteration starts with a properly scaled vector.
Rescaled BiCGStab Solver (Inner Loop):
- Residual Rescaling: The residual vector is rescaled at every iteration step using a factor $\gamma$ . This prevents the residual elements from underflowing as they approach zero.
- Solution Rescaling: An additional factor $\lambda$ is introduced to rescale the solution vector $x$ . This prevents overflow in cases where the system has small eigenvalues, which could cause the solution norm to exceed FP16 limits.
- Parameter Tuning: The authors introduce parameters $\sigma$ (for residual scaling) and $s$ (for input scaling) to optimize the dynamic range usage.
Implementation Details:
- Hardware: A64FX processor (Fugaku).
- Data Types: Used _Float16 (IEEE 754 binary16) for arithmetic to utilize hardware acceleration, rather than __fp16 (which only handles storage).
- Reduction Strategy: To maintain accuracy during global reductions (e.g., dot products), FP16 vectors are converted to FP32 within threads and then reduced to FP64 across MPI processes.
- Codebase: Implemented within the Bridge++ lattice QCD framework, utilizing the QXS branch for A64FX tuning.

3. Key Contributions

Stabilization of FP16 Solvers: The paper demonstrates that FP16 can be used effectively in LQCD if specific rescaling mechanisms are applied to prevent underflow and overflow, a problem not present in FP32 mixed-precision schemes.
Novel Rescaling Algorithm: Unlike existing LQCD libraries (e.g., QUDA) which focus on FP32, this work introduces a specific rescaling protocol for the BiCGStab algorithm tailored to the narrow dynamic range of FP16.
Performance Validation on A64FX: The study provides the first practical demonstration of FP16 mixed-precision solvers on the A64FX architecture, leveraging its SVE capabilities.
Parameter Optimization: The authors systematically analyzed the impact of scaling parameters ( $s$ and $\sigma$ ), showing that setting the input vector norm close to the FP16 maximum significantly reduces underflow without causing overflow.

4. Results

The experiments were conducted on the Fugaku supercomputer using a $32^3 \times 64$ lattice with a Wilson fermion matrix ( $\kappa = 0.13$ ).

Convergence Stability:
- Without Rescaling: The FP16 solver failed to converge efficiently, suffering from stagflation and requiring ~5,500 matrix-vector multiplications (MVMs).
- With Rescaling: The solver converged stably. The total number of MVMs was reduced to ~850–920, comparable to or slightly better than the FP32 mixed-precision case (~919 MVMs).
Underflow Mitigation:
- In the unmodified FP16 case, the number of zero-valued elements in the residual vector increased rapidly.
- With the proposed rescaling, underflow was largely eliminated, and the "diffusion" of information through the lattice was preserved.
Performance Gains:
- Throughput: FP16 MVM throughput reached 8,249 GFlops, compared to 3,895 GFlops for FP32 and 2,045 GFlops for FP64.
- Elapsed Time: The FP16 mixed-precision solver achieved a 2x speedup over the FP32 mixed-precision solver and a 3x speedup over the pure FP64 solver.
- Total Time: The FP16 solution took approximately 0.46 seconds, compared to 0.92s (FP32) and 1.39s (FP64).

5. Significance

Future-Proofing for AI-Driven Hardware: As high-performance computing (HPC) architectures increasingly integrate AI-optimized units (like Tensor Cores) that excel at low-precision arithmetic, this work provides a critical algorithmic pathway to utilize these features for scientific simulations like LQCD.
Efficiency on Fugaku NEXT: With Japan's next-generation supercomputer (Fugaku NEXT) planned to utilize NVIDIA GPUs with Tensor Cores, the ability to use FP16 effectively is essential for maximizing performance.
General Applicability: The proposed rescaling techniques are not limited to the Wilson fermion matrix or BiCGStab; the authors argue these methods can benefit other mixed-precision preconditioners and iterative solvers in scientific computing where underflow is a bottleneck.
Practical Viability: The study proves that FP16 is not just a theoretical possibility but a practical, high-performance option for LQCD, provided that numerical stability is managed through algorithmic rescaling rather than relying solely on hardware precision.

Conclusion: The paper successfully bridges the gap between the high theoretical performance of FP16 on modern SIMD architectures and the numerical stability requirements of Lattice QCD, achieving a 2x performance improvement over the current state-of-the-art FP32 mixed-precision solvers.

Mixed precision solvers with half-precision floating point numbers for Lattice QCD on A64FX processor