GCL-Sampler: Discovering Kernel Similarity for Sampled GPU Simulation via Graph Contrastive Learning

Imagine you are an architect trying to design a new, super-fast car engine. To make sure your design works, you need to run thousands of simulations. But here's the problem: running a full simulation of the engine is like trying to drive the car at full speed on a test track that is one million miles long. It would take you weeks or even months to finish just one test. You'd never get any new designs built!

This is exactly the problem computer scientists face with GPU simulators. GPUs (the chips in your graphics cards that power AI and video games) are incredibly complex. Simulating them perfectly is so slow that researchers can't test new ideas fast enough.

The Old Way: Guessing and Checking

To speed things up, researchers used to try "sampling." Instead of driving the whole million-mile track, they'd pick a few short segments to test and assume the rest of the track is similar.

But the old methods were like a clumsy detective:

The "Name-Tag" Detective: Some methods only looked at the name of the task. If two tasks had different names, they assumed they were totally different, even if they drove the engine exactly the same way. This meant they had to test almost everything, so they didn't save much time.
The "Counting" Detective: Others just counted how many instructions a task had. But two tasks can have the same number of instructions but behave completely differently (like two people walking the same number of steps but one is sprinting and the other is dancing). This led to bad guesses and wrong results.

The New Solution: GCL-Sampler (The "Super-Intuitive" Detective)

The authors of this paper, Jiaqi Wang and his team, built a new tool called GCL-Sampler. Think of it as a detective with a superpower: Pattern Recognition.

Instead of looking at names or simple counts, GCL-Sampler looks at the entire story of how the GPU works.

1. Turning Code into a Map (The Graph)

Imagine every instruction the GPU runs is a city, and the data it moves between instructions are the roads connecting them.

Old methods just looked at the city names.
GCL-Sampler builds a giant, 3D map (a "Graph") showing every road, every traffic light, and every turn. It captures the structure and the meaning of the code, not just the surface details.

2. The "Contrastive Learning" Gym

Now, how does the computer learn to recognize similar maps?
Imagine you have a gym with two mirrors. You show the computer two slightly different photos of the same city (maybe one has a tree missing, or the lighting is different).

The computer learns: "Hey, even though these photos look slightly different, they are the same city!"
Then, it shows a photo of a totally different city (like a desert vs. a jungle).
The computer learns: "These are totally different."

This is called Contrastive Learning. The computer trains itself to ignore the small, unimportant details (noise) and focus on the deep, structural similarities. It learns to say, "These two GPU tasks are twins, even if they have different names!"

3. The Result: The Perfect Shortcut

Once the computer has learned this, it groups thousands of GPU tasks into "families" based on how they actually behave.

Instead of testing 10,000 different tasks, it picks one representative from each family.
It simulates just that one, and then mathematically scales the result to represent the whole family.

The Magic Numbers

The paper shows that this new "Super-Intuitive Detective" is a game-changer:

Speed: It makes the simulation 259 times faster. (Imagine finishing a 100-year project in less than a week!).
Accuracy: It is 99.6% accurate. The old methods either were fast but wrong (20% error) or accurate but slow. GCL-Sampler gets the best of both worlds.
Real World: They tested it on everything from scientific math problems to massive AI models (like the ones powering chatbots), and it worked great on different generations of computer chips.

The Bottom Line

GCL-Sampler is like having a time machine for computer architects. By using advanced AI to understand the "soul" of the code rather than just its "clothes" (names or counts), it allows researchers to skip the boring, repetitive parts of testing and focus on the important stuff. It turns a process that used to take weeks into something that takes minutes, without losing any accuracy.

1. Problem Statement

GPU architectural simulation is essential for microarchitecture design and optimization but suffers from severe performance penalties, often running orders of magnitude slower than native execution. Simulating large-scale workloads (e.g., trillions of instructions in ML models) can take days or weeks, creating a bottleneck for research.

Workload sampling is the standard solution, where a representative subset of execution intervals is simulated while the rest is skipped. However, existing sampling methods face a fundamental trade-off between fidelity (accuracy) and speedup:

Hand-crafted Features: Methods like PKA rely on manually designed features (e.g., memory access patterns, instruction mixes). These lack expressiveness, leading to high sampling errors (e.g., 20.90% error) when clustering kernels.
Conservative Strategies: Methods like Sieve and STEM+ROOT use strict grouping rules (e.g., kernel names) or statistical error modeling to reduce errors. While accurate, they are overly conservative, failing to cluster kernels with different names but similar behaviors, resulting in low speedups (e.g., 56.57×).

The core challenge is to automatically discover high-dimensional kernel similarities without relying on limited hand-crafted features, thereby achieving both high accuracy and maximum speedup.

2. Methodology: GCL-Sampler

The authors propose GCL-Sampler, a framework that replaces hand-crafted features with learned graph embeddings using Relational Graph Convolutional Networks (RGCN) and Contrastive Learning.

A. Trace Collection & Graph Construction

Tracing: The system uses NVBit (dynamic binary instrumentation) to collect SASS (Streaming Assembler) traces. It captures a single representative Streaming Multiprocessor (SM) per kernel invocation to balance overhead and representativeness.
Heterogeneous Relational Graphs (HRGs): Linear instruction traces are transformed into HRGs to capture structural and semantic properties.
- Nodes: Three types are defined:
  1. Instruction Nodes: Represent SASS opcodes (e.g., LDG).
  2. Pseudo Nodes: Represent internal operations within an instruction (e.g., memory references).
  3. Variable Nodes: Represent dynamic values (registers, memory addresses). New nodes are created on writes; reads connect to the most recent version.
- Edges: Two types capture dependencies:
  1. Control Flow: Connects consecutive instructions.
  2. Data Flow: Connects source operands to destination results.

B. RGCN with Contrastive Learning

Embedding Generation: An unsupervised RGCN encoder processes the HRGs. It uses basis decomposition to handle different edge types efficiently.
- Architecture: 3 RGCN layers (Input: 64, Hidden: 128, Output: 256).
- Feature Engineering: Instruction nodes use token IDs + positional encoding; variable nodes use token IDs + statistical summaries (mean, std, skewness, etc.) of dynamic values.
Contrastive Learning: The model is trained using InfoNCE loss to learn robust representations without ground-truth labels.
- Augmentation: Two augmented views of the same kernel graph are created via node dropping, edge dropping, and feature noise injection.
- Objective: Pull positive pairs (views of the same kernel) closer and push negative pairs (views of different kernels) apart in the embedding space.
Clustering: After training, the 256-dimensional graph embeddings serve as kernel signatures. K-Means clustering is applied to group similar kernels. The number of clusters ( $K$ ) is optimized using the silhouette coefficient. The first kernel in each cluster is selected as the representative for simulation.

3. Key Contributions

Novel Graph-Based Representation: The first framework to encode GPU kernel execution into Heterogeneous Relational Graphs, capturing both control flow topology and data dependencies, superseding limited hand-crafted features.
High-Fidelity, High-Speed Sampling: By leveraging contrastive learning on trace graphs, GCL-Sampler automatically discovers fine-grained behavioral similarities, breaking the traditional accuracy-speedup trade-off.
Comprehensive Validation: The method is validated across diverse benchmarks (scientific computing, LLMs), multiple microarchitectural metrics (cycles, IPC, cache hits), and three different GPU generations (Turing, Ampere, Ada Lovelace).

4. Experimental Results

The evaluation was conducted on 7,746 kernels across 11 programs (including PolyBench, Rodinia, and LLMs like Qwen1.5 and Phi-2).

Performance vs. State-of-the-Art:
- GCL-Sampler: Achieved 258.94× average speedup with only 0.37% error.
- PKA: 129.23× speedup, 20.90% error.
- Sieve: 94.90× speedup, 4.10% error.
- STEM+ROOT: 56.57× speedup, 0.38% error.
- Key Insight: GCL-Sampler successfully grouped kernels with different names but similar performance (e.g., in the nw benchmark), whereas name-based methods (Sieve, STEM+ROOT) failed to find reduction opportunities, and PKA suffered from high errors due to poor feature expressiveness.
Cross-Architecture Robustness:
- Clustering decisions made on Turing (P1) were applied to Ampere (P2) and Ada Lovelace (P3).
- Errors remained low (1.50% on P2, 1.22% on P3), demonstrating that the learned embeddings generalize well across hardware generations.
- Exception: The phi-2 workload showed higher error (~10%) on newer architectures due to runtime algorithm selection changes in cuDNN triggered by profiling, a known issue in the field.
Microarchitectural Metrics:
- Beyond cycle counts, the method preserved accuracy for Occupancy, IPC, L1/L2 Cache Hit Rates, with negligible deviation between sampled and full simulations.
End-to-End Simulation:
- Integrated with HyFiSS simulator. Simulating the full nw workload took 22 minutes; GCL-Sampler completed it in 10 seconds (128× speedup) with 0.5% cycle error.

5. Significance

GCL-Sampler represents a paradigm shift in GPU simulation methodology. By moving from hand-crafted heuristics to learned structural representations, it resolves the long-standing conflict between simulation speed and accuracy.

Practical Impact: It enables architects to explore design spaces and evaluate emerging workloads (like large LLMs) that were previously too expensive to simulate in full.
Generalizability: The cross-architecture validation proves that the method is robust enough to be used for simulating future, unreleased GPU hardware based on data from current generations.
Scalability: The one-time preprocessing cost of trace generation and model training is amortized over multiple design iterations, making it a viable tool for industrial and academic research workflows.

GCL-Sampler: Discovering Kernel Similarity for Sampled GPU Simulation via Graph Contrastive Learning

The Old Way: Guessing and Checking

The New Solution: GCL-Sampler (The "Super-Intuitive" Detective)

1. Turning Code into a Map (The Graph)

2. The "Contrastive Learning" Gym

3. The Result: The Perfect Shortcut

The Magic Numbers

The Bottom Line

1. Problem Statement

2. Methodology: GCL-Sampler

A. Trace Collection & Graph Construction

B. RGCN with Contrastive Learning

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank