Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures

This paper presents highly accurate analytical performance models for modern NVIDIA Blackwell and AMD CDNA3 GPU architectures, grounded in systematic microbenchmark characterization that significantly outperforms naive roofline baselines while demonstrating portability to previous generations.

Original authors: Aaron Jarmusch, Sunita Chandrasekaran

Published 2026-05-07
📖 4 min read☕ Coffee break read

Original authors: Aaron Jarmusch, Sunita Chandrasekaran

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to predict how long it will take a super-fast delivery truck to drop off a package.

The Old Way (The "Naive Roofline"):
For years, engineers used a simple rule of thumb: "If the truck can drive 100 mph and the package weighs 10 pounds, it will take X minutes." They looked at the truck's top speed (the "theoretical peak") and the road conditions (memory bandwidth) and did a quick math problem.

The Problem:
This old rule fails miserably on modern trucks (GPUs). Why? Because real life is messy.

  • The truck doesn't just drive; it has to stop at a loading dock, wait for a specific elevator, load the package into a special container, and then drive.
  • Sometimes the truck has to wait for a second truck to help.
  • Sometimes the road has a "secret tunnel" (a cache) that makes the trip faster than the main highway, but the old rule doesn't know about the tunnel.
  • The "top speed" listed on the truck's brochure is often a fantasy number that the truck can never actually sustain in real traffic.

The paper says that using this old rule leads to 95% to 99% errors. It's like predicting a 10-minute trip will take 10 hours, or vice versa.

The New Solution (The "Microbenchmark-Driven Model"):
The authors (Aaron Jarmusch and Sunita Chandrasekaran) built a new, super-accurate prediction system for the two most advanced "trucks" on the market today:

  1. NVIDIA Blackwell (B200): The latest high-tech truck.
  2. AMD CDNA3 (MI300A): The latest competitor truck.

Instead of guessing based on brochures, they went out and measured exactly how these trucks behave in real life. They ran tiny, specific tests (microbenchmarks) to time every single step of the delivery process.

How They Did It (The Analogy):

  • For the NVIDIA Truck (Blackwell):
    They realized this truck has a very specific, assembly-line style. It has a special "loading dock" (called TMEM) and a "bulk loader" (called TMA) that moves things automatically.

    • The Model: They built a step-by-step stopwatch. "Step 1: Load data (takes 420 nanoseconds). Step 2: Move to the special dock. Step 3: Process the math. Step 4: Sync with the other truck."
    • Result: They predicted the time with 1.3% error. That's like predicting a 10-minute trip and being off by only 8 seconds.
  • For the AMD Truck (MI300A):
    This truck is different. It has a massive "warehouse" right next to the driver (called Infinity Cache) and the driver has to manage their own seat space (registers).

    • The Model: They created a formula that asks: "Is the package small enough to fit in the warehouse? If yes, it's super fast. If no, it has to go to the slow highway." They also checked how crowded the driver's seat is (occupancy).
    • Result: They predicted the time with 0.09% error. That is incredibly precise—almost perfect.

Why This Matters:
The authors tested their new models on real-world jobs (like complex math problems used in science and AI).

  • The old "Roofline" method was wrong almost every time (off by nearly 100%).
  • Their new method was right almost every time.

The "Plug-and-Play" Feature:
The coolest part is that they didn't have to invent a whole new system for older trucks (like the NVIDIA H200 or AMD MI250X). They just took their existing model, swapped out the "speed limit" and "warehouse size" numbers, and it worked again. It's like having a GPS app that works for a Ford, a Toyota, and a Tesla just by changing the car model in the settings, without needing to rewrite the map.

The Catch (Limitations):
The model works great when the "delivery" is smooth and predictable (like moving a big block of data). If the delivery involves zig-zagging through a maze (irregular data) or stopping for tiny, split-second tasks, the model gets a little less accurate. Also, the model relies on someone telling it exactly how much data is being moved; if that input is wrong, the prediction will be wrong.

In Summary:
The authors built a "smart GPS" for modern supercomputers. Instead of guessing based on marketing brochures, they measured the actual behavior of the hardware. This allows engineers to know exactly how long a task will take on these new machines with near-perfect accuracy, something the old methods couldn't do. They promise to share all their tools and measurements with the public so everyone can use them.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →