Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization

Here is an explanation of the paper using simple language and creative analogies.

The Big Picture: The "High-End Chef" Problem

Imagine you are a master chef (a supercomputer) trying to bake a very delicate, high-precision cake (a scientific calculation). To get the cake right, you need Double-Precision (FP64) measurements. This is like using a scale that measures ingredients down to the microgram.

However, your kitchen has changed. The new, super-fast ovens and mixers (modern AI chips like NVIDIA's Blackwell and Rubin) are incredibly fast at chopping vegetables and mixing batter using low-precision measurements (like "a pinch" or "a handful"). These low-precision tools are 100x faster than the old high-precision scales.

The Problem:
The new ovens have a catch. They are great at chopping vegetables (INT8) and mixing batter (FP8), but they are getting worse at using the old, slow, high-precision scales. In fact, some new ovens barely have the high-precision scales at all.

If you want to bake that delicate cake on these new ovens, you can't just use the slow scales. You have to use the fast, low-precision tools to fake the high-precision result. This is called Emulation.

The Old Trick: The "Ozaki-I" Method

Scientists previously figured out a way to do this using a method called Ozaki-I.

The Analogy: Imagine you need to measure a huge distance, but you only have a 1-foot ruler. You measure the distance 100 times, adding up the results.
The Flaw: To get high precision, you have to measure so many times (121 times in this case) that you spend all your time measuring and not much time baking. It's accurate, but slow.

The New Trick: The "Ozaki-II" Method

There is a smarter way called Ozaki-II.

The Analogy: Instead of measuring the whole distance with a 1-foot ruler, you use a system of "Chinese Remainder Theorem" (CRT). Think of it like asking three different people to guess the number of grains of sand in a jar.
- Person A counts by 3s.
- Person B counts by 5s.
- Person C counts by 7s.
- If you know the remainders from all three, you can mathematically reconstruct the exact total number without ever counting them all individually.
The Benefit: This method is much faster because you do fewer "measurements" (multiplications).

The Specific Challenge: Why "INT8" vs. "FP8" Matters

The original Ozaki-II method was designed for INT8 (Integer 8-bit).

INT8 is like counting whole apples. It's perfect for the "remainder" math because it deals with whole numbers only.
FP8 (Floating Point 8-bit) is like measuring apples in cups. It has decimals.

The Problem: The original Ozaki-II recipe breaks if you try to use FP8 (cups) instead of INT8 (apples). The math gets messy because the "decimal parts" (exponents) in FP8 interfere with the clean "remainder" math needed for the trick to work.

The Authors' Solution: The "Hybrid Chef"

The authors (Uchino, Ozaki, and Imamura) invented a new way to make the Ozaki-II trick work with FP8 (the new, dominant tool on future chips).

They created a Hybrid Method:

The Square Moduli Trick: For some of the "guessers" in our sand-counting analogy, they chose numbers that are perfect squares (like 33² = 1089). This allows them to use a special math shortcut (Modular Reduction) that skips the messy steps.
The Karatsuba Extension: For the other numbers, they used a technique called Karatsuba (a fast multiplication trick) to break the big numbers down into smaller chunks that fit inside the FP8 "cup."

The Result: They managed to combine these two tricks so that they can use the fast FP8 hardware to do the high-precision math, but they only need to do 36 "measurements" (matrix multiplications) instead of the 121 required by the old Ozaki-I method.

Why Does This Matter?

Future-Proofing: New supercomputers (like NVIDIA's Rubin) are removing the "whole apple" (INT8) counters and focusing entirely on the "cup" (FP8) counters. If you don't have a way to use FP8 for high-precision math, you can't run these scientific simulations on the fastest new hardware.
Speed vs. Memory:
- Speed: The new FP8 method is slower than the old INT8 method (about 2.5x slower) because FP8 is slightly less efficient for this specific math.
- Memory: The new method uses more computer memory (RAM) to store the temporary "cups" of data.
- The Trade-off: However, on the newest chips where INT8 is barely available, the FP8 method is the only way to get the job done. It's better to be slightly slower and use more memory than to be unable to run the simulation at all.

The Bottom Line

The paper says: "We found a clever way to use the new, fast, low-precision 'cups' (FP8) to bake the delicate 'high-precision cake' (Double-Precision Math) that scientists need. While the old 'whole apple' (INT8) method is still faster if you have it, our new method ensures that even if the kitchen only has 'cups' (like on the upcoming Rubin chips), we can still bake the cake."

In short: They built a bridge so that high-precision science can run on the next generation of AI supercomputers, even when those computers stop supporting the old, high-precision tools.

Here is a detailed technical summary of the paper "Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization."

1. Problem Statement

In High-Performance Computing (HPC), Double-Precision (FP64) arithmetic is essential for numerical accuracy and stability. However, recent hardware trends (e.g., NVIDIA Blackwell Ultra and Rubin architectures) prioritize low-precision floating-point formats (FP8, FP4) over integer formats (INT8), significantly reducing INT8 throughput.

The Challenge: Existing methods for emulating FP64 matrix multiplication (DGEMM) using low-precision arithmetic rely heavily on the Ozaki-II scheme, which was originally designed for INT8 Tensor Cores.
The Limitation: The original Ozaki-II algorithm relies on fixed-point arithmetic and modular arithmetic properties that do not translate directly to FP8 (specifically the E4M3 format). FP8 includes exponent bits, which complicates the exact integer reconstruction required by the Chinese Remainder Theorem (CRT) used in the scheme. Furthermore, naive application of FP8 to Ozaki-II results in insufficient dynamic range to emulate FP64 precision.

2. Methodology

The authors propose a novel FP8-based Ozaki-II scheme that adapts the algorithm to work with FP8 E4M3 Tensor Cores while maintaining FP64 precision.

A. Core Algorithm Adaptation

The standard Ozaki-II scheme decomposes FP64 matrices into integer matrices, performs multiplication modulo $p_\ell$ , and reconstructs the result via CRT.

Direct FP8 Failure: Directly mapping the INT8 algorithm to FP8 fails because the dynamic range of FP8 E4M3 (integers -16 to 16) is too small to support the large moduli ( $p_\ell$ ) required for FP64 emulation.
Karatsuba Extension: To increase the representable range, the authors decompose each matrix element into a sum of two FP8 matrices ( $A' = s \cdot A^{(1)} + A^{(2)}$ ). Using the Karatsuba algorithm, the product is computed using three error-free FP8 matrix multiplications instead of one. This allows the use of larger moduli ( $p_\ell \le 513$ ).
Hybrid Modular Reduction: To further optimize, the authors introduce a hybrid approach:
- For square moduli ( $p_\ell = s^2$ ), they use a specialized modular reduction technique that avoids the full Karatsuba reconstruction overhead, requiring only three multiplications but with different constraints.
- For non-square moduli, they use the standard Karatsuba extension.
- This hybrid selection allows the use of larger moduli (up to 1089) and reduces the total number of required moduli ( $N$ ) to achieve FP64 precision.

B. Precision and Error Analysis

Exactness: The method ensures that intermediate FP8 matrix multiplications accumulate in FP32 without rounding errors. This is achieved by carefully scaling inputs so that the product of mantissas fits within the 24-bit significand of FP32.
Precision Requirement: To emulate FP64 (53 bits of significand), the proposed method requires $N \ge 12$ moduli (using the hybrid set), whereas the INT8-based Ozaki-II requires $N \ge 14$ .

C. Implementation Details

Scaling Vectors: The paper details a "fast mode" (using Cauchy-Schwarz bounds) and an "accurate mode" (using low-precision matrix multiplication to estimate bounds) for determining scaling factors ( $\mu, \nu$ ) to prevent overflow during conversion to FP8.
Library: An open-source library is provided supporting both NVIDIA and AMD GPUs, ensuring bitwise reproducibility.

3. Key Contributions

Algorithmic Innovation: Demonstrated that the Ozaki-II scheme, previously thought incompatible with FP8 due to its fixed-point nature, can be adapted using a Karatsuba-based extension and hybrid modular reduction.
Efficiency Gains: The proposed FP8-based method reduces the number of required matrix multiplications compared to a naive Karatsuba extension. Specifically, it requires 36 FP8 matrix multiplications (for $N=12$ ) to achieve FP64 precision, compared to 121 multiplications for the FP8-based Ozaki-I scheme.
Hardware Analysis: Clarified why FP8 is preferred over FP16/BF16 for this specific emulation: FP8 offers a better balance between mantissa bits (allowing error-free accumulation in FP32 for larger $k$ ) and hardware throughput (NVIDIA Rubin shows FP8 throughput $\approx 4.4\times$ FP16).
Comprehensive Benchmarking: Provided analytic performance models, memory footprint analysis, and empirical results on RTX 5080 and B200 GPUs.

4. Results

The authors evaluated the method on NVIDIA RTX 5080 and B200 GPUs.

Throughput vs. INT8:
- On RTX 5080 (where INT8 and FP8 throughput are comparable), the INT8-based Ozaki-II is 1.3–2.9 $\times$ faster than the FP8-based method. This is due to the INT8 method requiring fewer matrix multiplications ( $N$ vs. $3N$) and lower memory overhead.
- On B200 (where INT8 throughput is drastically reduced), the FP8-based method becomes a viable alternative, though still slower than INT8 on B200 due to the B200's specific INT8 reduction.
Throughput vs. Native FP64:
- Both emulation methods significantly outperform native FP64 DGEMM.
- On B200, the INT8 method achieves 125 TFLOP/s (vs. native ~33 TFLOP/s), while the FP8 method achieves 64 TFLOP/s.
- The FP8 method is projected to exceed 200 TFLOP/s on future Rubin hardware, where INT8 is further de-emphasized.
Memory Footprint:
- The FP8 method requires significantly more working memory (e.g., 55 GB vs. 27 GB for $16384^3$ matrices) because it stores residues in multiple FP8 matrices and uses INT16 for intermediate results, whereas INT8 uses single INT8 matrices.
Accuracy:
- The "accurate mode" achieves accuracy comparable to native FP64 and the INT8-based Ozaki-I baseline, with relative errors close to machine epsilon.

5. Significance

Future-Proofing HPC: As AI-driven hardware architectures (like NVIDIA Rubin) shift away from INT8 toward FP8/FP4, this work provides a critical pathway to maintain high-precision scientific computing capabilities without relying on diminishing FP64 hardware units.
Algorithmic Flexibility: It proves that high-precision emulation is not limited to integer arithmetic; floating-point low-precision formats can be engineered to support exact integer reconstruction via CRT if the algorithm is carefully adapted.
Practical Trade-offs: The paper establishes a clear trade-off: INT8 remains superior for throughput and memory efficiency where available, but FP8 is the necessary fallback for next-generation architectures where INT8 resources are scarce. The proposed method ensures HPC applications can continue to run on these emerging FP8-dominant accelerators.