Microbenchmark-Driven Analytical Performance Modeling… — Plain-Language Explanation

Imagine you are trying to predict how long it will take a super-fast delivery truck to drop off a package.

The Old Way (The "Naive Roofline"):
For years, engineers used a simple rule of thumb: "If the truck can drive 100 mph and the package weighs 10 pounds, it will take X minutes." They looked at the truck's top speed (the "theoretical peak") and the road conditions (memory bandwidth) and did a quick math problem.

The Problem:
This old rule fails miserably on modern trucks (GPUs). Why? Because real life is messy.

The truck doesn't just drive; it has to stop at a loading dock, wait for a specific elevator, load the package into a special container, and then drive.
Sometimes the truck has to wait for a second truck to help.
Sometimes the road has a "secret tunnel" (a cache) that makes the trip faster than the main highway, but the old rule doesn't know about the tunnel.
The "top speed" listed on the truck's brochure is often a fantasy number that the truck can never actually sustain in real traffic.

The paper says that using this old rule leads to 95% to 99% errors. It's like predicting a 10-minute trip will take 10 hours, or vice versa.

The New Solution (The "Microbenchmark-Driven Model"):
The authors (Aaron Jarmusch and Sunita Chandrasekaran) built a new, super-accurate prediction system for the two most advanced "trucks" on the market today:

NVIDIA Blackwell (B200): The latest high-tech truck.
AMD CDNA3 (MI300A): The latest competitor truck.

Instead of guessing based on brochures, they went out and measured exactly how these trucks behave in real life. They ran tiny, specific tests (microbenchmarks) to time every single step of the delivery process.

How They Did It (The Analogy):

For the NVIDIA Truck (Blackwell):
They realized this truck has a very specific, assembly-line style. It has a special "loading dock" (called TMEM) and a "bulk loader" (called TMA) that moves things automatically.
- The Model: They built a step-by-step stopwatch. "Step 1: Load data (takes 420 nanoseconds). Step 2: Move to the special dock. Step 3: Process the math. Step 4: Sync with the other truck."
- Result: They predicted the time with 1.3% error. That's like predicting a 10-minute trip and being off by only 8 seconds.
For the AMD Truck (MI300A):
This truck is different. It has a massive "warehouse" right next to the driver (called Infinity Cache) and the driver has to manage their own seat space (registers).
- The Model: They created a formula that asks: "Is the package small enough to fit in the warehouse? If yes, it's super fast. If no, it has to go to the slow highway." They also checked how crowded the driver's seat is (occupancy).
- Result: They predicted the time with 0.09% error. That is incredibly precise—almost perfect.

Why This Matters:
The authors tested their new models on real-world jobs (like complex math problems used in science and AI).

The old "Roofline" method was wrong almost every time (off by nearly 100%).
Their new method was right almost every time.

The "Plug-and-Play" Feature:
The coolest part is that they didn't have to invent a whole new system for older trucks (like the NVIDIA H200 or AMD MI250X). They just took their existing model, swapped out the "speed limit" and "warehouse size" numbers, and it worked again. It's like having a GPS app that works for a Ford, a Toyota, and a Tesla just by changing the car model in the settings, without needing to rewrite the map.

The Catch (Limitations):
The model works great when the "delivery" is smooth and predictable (like moving a big block of data). If the delivery involves zig-zagging through a maze (irregular data) or stopping for tiny, split-second tasks, the model gets a little less accurate. Also, the model relies on someone telling it exactly how much data is being moved; if that input is wrong, the prediction will be wrong.

In Summary:
The authors built a "smart GPS" for modern supercomputers. Instead of guessing based on marketing brochures, they measured the actual behavior of the hardware. This allows engineers to know exactly how long a task will take on these new machines with near-perfect accuracy, something the old methods couldn't do. They promise to share all their tools and measurements with the public so everyone can use them.

Technical Summary: Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures

Problem Statement
Modern High-Performance Computing (HPC) and AI systems rely on rapidly evolving GPU architectures (e.g., NVIDIA Blackwell B200 and AMD CDNA3 MI300A) featuring complex memory hierarchies, specialized matrix units, and varied precision formats. A significant gap exists between theoretical peak performance and achievable efficiency. Traditional performance modeling, specifically the "naive roofline" model, fails to accurately predict execution times on these modern accelerators. The authors argue that the naive roofline approach, which relies on a single maximum function of compute and memory bounds using datasheet peaks, ignores critical architectural realities: serialized pipeline stages, dedicated matrix paths, Tensor Memory (TMEM) residency, and occupancy-driven constraints. Consequently, naive roofline baselines exhibit errors exceeding 95% on modern kernels, rendering them ineffective for performance engineering and optimization.

Methodology
The paper proposes a systematic, microbenchmark-driven approach to construct analytical performance models for two current-generation architectures: NVIDIA Blackwell (B200) and AMD CDNA3 (MI300A).

Microbenchmark Characterization: The authors first characterize the hardware using targeted low-level microbenchmarks. These measurements derive model parameters directly from the hardware, including sustained bandwidths (HBM, TMEM, Infinity Cache), instruction latencies (TMA, tensor cores, barriers), and occupancy limits. This contrasts with relying solely on vendor datasheet peaks, which often overstate achievable throughput.
Stage-Centric and Wavefront-Centric Modeling:
- NVIDIA Blackwell (B200): The model adopts a stage-centric framework, explicitly modeling the pipeline stages: Tensor Memory Accelerator (TMA) $\rightarrow$ Tensor Memory (TMEM) $\rightarrow$ 5th-generation Tensor Cores $\rightarrow$ Synchronization. It accounts for asynchronous bulk copy, TMEM capacity constraints (256 KB/SM), decompression engines, and 2-SM cooperative execution.
- AMD CDNA3 (MI300A): The model utilizes a wavefront-centric framework, focusing on implicit overlap driven by occupancy. It incorporates the Infinity Cache hierarchy (256 MB), Vector General Purpose Register (VGPR) constraints, and the trade-off between tile size and occupancy. It models the L1/L2/Infinity Cache/HBM memory hierarchy and the impact of working set size on cache hit rates.
Validation Strategy: The models are validated against a suite of 21 microbenchmarks for B200 and 27 for MI300A. Furthermore, they are tested on full application benchmarks from Rodinia 3.1 and SPEChpc 2021 Tiny. The authors also demonstrate portability by applying the same model frameworks to the previous generation of each vendor (NVIDIA H200 and AMD MI250X) simply by updating hardware parameters, without re-deriving the model formulas.

Key Contributions

First Validated Execution-Time Models: The paper presents, to the authors' knowledge, the first validated analytical execution-time models specifically for the NVIDIA Blackwell (B200) and AMD CDNA3 (MI300A) architectures.
Novel Architectural Terms: The models introduce specific terms to capture modern features previously ignored by analytical models, including TMEM/TMA interactions on Blackwell and Infinity Cache hierarchy/VGPR pressure on CDNA3.
Cross-Vendor Validation: The work provides a unified validation protocol across competing vendors, reporting Mean Absolute Error (MAE) under shared conditions.
Portability Demonstration: The authors demonstrate that the model frameworks are extensible. By updating parameters (e.g., bandwidth, cache size) derived from microbenchmarks, the models successfully predict performance on H200 and MI250X without structural changes.

Results

Microbenchmark Accuracy: The proposed models achieve high accuracy on microbenchmarks.
- Blackwell (B200): 1.31% MAE across 21 kernels.
- CDNA3 (MI300A): ~0.09% MAE across 27 kernels (achieved with host-measured calibration multipliers; uncalibrated models yield ~5–8% MAE).
- Comparison: In contrast, naive roofline baselines using only datasheet peaks exceed 95% error on the same kernels (e.g., 96.1% on B200, 99.6% on MI300A).
Application Benchmarks:
- Rodinia 3.1: On MI300A, the model achieves 12.5% MAE overall, with near-zero error on regular workloads (e.g., pathfinder, srad) and higher error on irregular access patterns (e.g., bfs, hotspot).
- SPEChpc 2021 Tiny: On MI300A, the model achieves 1.3% MAE when using profiler-derived FLOP/byte counts. However, when using first-principles (source-code) analysis, error rises to ~92.5%, highlighting a discrepancy between compiler-generated kernels and source-level algorithm analysis rather than a failure of the performance model itself.
Portability: When applied to H200 and MI250X without re-characterization of the workload segments, application-level MAE increases (e.g., H200 Rodinia 43.6%), confirming that while the model structure is portable, accurate workload characterization remains platform-specific.

Significance and Claims
The paper claims that architecture-specific analytical modeling is necessary to bridge the gap between theoretical peaks and actual performance on modern GPUs. The authors emphasize that the "naive roofline" is insufficient because it cannot represent serialized pipeline stages (Blackwell) or occupancy-driven cache hierarchies (CDNA3).

The significance of this work lies in its ability to provide interpretable, parameterized models that accurately predict execution time within 1–5% MAE for microbenchmarks and regular applications. The authors assert that their approach shifts the bottleneck from model formulation to workload characterization. They note that while the models are highly accurate for regular, data-parallel workloads, they face limitations with irregular access patterns (e.g., sparse matrices, pointer chasing) and very short kernels where launch overhead dominates.

The paper concludes that these models enable practical applications such as procurement comparisons between vendors without physical access, autotuning guidance for tile sizes and precision, and rapid performance estimation on new hardware by simply running microbenchmarks to update parameters. The authors also highlight that existing benchmark suites (like Rodinia) may not fully exercise modern primitives like TMA or TMEM, suggesting a need for new benchmarks that directly target these features.

Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures

More like this