Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization

This paper proposes a novel technique to emulate double-precision (FP64) matrix multiplication using FP8 matrix multiply-accumulate units via the Ozaki-II scheme, overcoming previous algorithmic limitations to significantly reduce computational requirements and enable efficient high-accuracy performance on emerging GPU architectures.

Yuki Uchino, Katsuhisa Ozaki, Toshiyuki Imamura

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Here is an explanation of the paper using simple language and creative analogies.

The Big Picture: The "High-End Chef" Problem

Imagine you are a master chef (a supercomputer) trying to bake a very delicate, high-precision cake (a scientific calculation). To get the cake right, you need Double-Precision (FP64) measurements. This is like using a scale that measures ingredients down to the microgram.

However, your kitchen has changed. The new, super-fast ovens and mixers (modern AI chips like NVIDIA's Blackwell and Rubin) are incredibly fast at chopping vegetables and mixing batter using low-precision measurements (like "a pinch" or "a handful"). These low-precision tools are 100x faster than the old high-precision scales.

The Problem:
The new ovens have a catch. They are great at chopping vegetables (INT8) and mixing batter (FP8), but they are getting worse at using the old, slow, high-precision scales. In fact, some new ovens barely have the high-precision scales at all.

If you want to bake that delicate cake on these new ovens, you can't just use the slow scales. You have to use the fast, low-precision tools to fake the high-precision result. This is called Emulation.

The Old Trick: The "Ozaki-I" Method

Scientists previously figured out a way to do this using a method called Ozaki-I.

  • The Analogy: Imagine you need to measure a huge distance, but you only have a 1-foot ruler. You measure the distance 100 times, adding up the results.
  • The Flaw: To get high precision, you have to measure so many times (121 times in this case) that you spend all your time measuring and not much time baking. It's accurate, but slow.

The New Trick: The "Ozaki-II" Method

There is a smarter way called Ozaki-II.

  • The Analogy: Instead of measuring the whole distance with a 1-foot ruler, you use a system of "Chinese Remainder Theorem" (CRT). Think of it like asking three different people to guess the number of grains of sand in a jar.
    • Person A counts by 3s.
    • Person B counts by 5s.
    • Person C counts by 7s.
    • If you know the remainders from all three, you can mathematically reconstruct the exact total number without ever counting them all individually.
  • The Benefit: This method is much faster because you do fewer "measurements" (multiplications).

The Specific Challenge: Why "INT8" vs. "FP8" Matters

The original Ozaki-II method was designed for INT8 (Integer 8-bit).

  • INT8 is like counting whole apples. It's perfect for the "remainder" math because it deals with whole numbers only.
  • FP8 (Floating Point 8-bit) is like measuring apples in cups. It has decimals.

The Problem: The original Ozaki-II recipe breaks if you try to use FP8 (cups) instead of INT8 (apples). The math gets messy because the "decimal parts" (exponents) in FP8 interfere with the clean "remainder" math needed for the trick to work.

The Authors' Solution: The "Hybrid Chef"

The authors (Uchino, Ozaki, and Imamura) invented a new way to make the Ozaki-II trick work with FP8 (the new, dominant tool on future chips).

They created a Hybrid Method:

  1. The Square Moduli Trick: For some of the "guessers" in our sand-counting analogy, they chose numbers that are perfect squares (like 33² = 1089). This allows them to use a special math shortcut (Modular Reduction) that skips the messy steps.
  2. The Karatsuba Extension: For the other numbers, they used a technique called Karatsuba (a fast multiplication trick) to break the big numbers down into smaller chunks that fit inside the FP8 "cup."

The Result: They managed to combine these two tricks so that they can use the fast FP8 hardware to do the high-precision math, but they only need to do 36 "measurements" (matrix multiplications) instead of the 121 required by the old Ozaki-I method.

Why Does This Matter?

  1. Future-Proofing: New supercomputers (like NVIDIA's Rubin) are removing the "whole apple" (INT8) counters and focusing entirely on the "cup" (FP8) counters. If you don't have a way to use FP8 for high-precision math, you can't run these scientific simulations on the fastest new hardware.
  2. Speed vs. Memory:
    • Speed: The new FP8 method is slower than the old INT8 method (about 2.5x slower) because FP8 is slightly less efficient for this specific math.
    • Memory: The new method uses more computer memory (RAM) to store the temporary "cups" of data.
    • The Trade-off: However, on the newest chips where INT8 is barely available, the FP8 method is the only way to get the job done. It's better to be slightly slower and use more memory than to be unable to run the simulation at all.

The Bottom Line

The paper says: "We found a clever way to use the new, fast, low-precision 'cups' (FP8) to bake the delicate 'high-precision cake' (Double-Precision Math) that scientists need. While the old 'whole apple' (INT8) method is still faster if you have it, our new method ensures that even if the kitchen only has 'cups' (like on the upcoming Rubin chips), we can still bake the cake."

In short: They built a bridge so that high-precision science can run on the next generation of AI supercomputers, even when those computers stop supporting the old, high-precision tools.