GPU-Fuzz: Finding Memory Errors in Deep Learning Frameworks

Imagine you are a master chef running a massive, high-tech kitchen (this is your Deep Learning Framework, like PyTorch or TensorFlow). You have hundreds of specialized robots (the GPUs) doing the heavy lifting, chopping, and cooking at lightning speed.

For years, food critics (security researchers) have been checking if the robots are following the recipe correctly. They ask: "Did the robot chop the onions into the right size? Is the soup the right temperature?" This is like checking if the math is right.

But there's a hidden danger the critics missed: The robots might be cutting their own fingers or knocking over the spice rack while they work.

This paper introduces GPU-Fuzz, a new kind of inspector designed specifically to catch these "finger-cutting" accidents, which are called memory errors.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Silent" Crash

In a normal kitchen, if a robot drops a pot, you hear a crash. But in the world of AI, a robot can knock over a jar of poison (corrupt memory) and keep cooking. The soup tastes fine, but it's actually toxic.

The Issue: These errors happen deep inside the robot's brain (the CUDA kernel). They occur when the robot tries to grab an ingredient from a shelf that doesn't exist (out-of-bounds access) or grabs the wrong jar because the label was blurry (misaligned memory).
The Danger: This can cause the whole kitchen to catch fire (crash) or, worse, let a hacker sneak in and steal your secret recipes (security breach).

2. The Old Way: Guessing the Menu

Previous inspectors (like NNSmith) tried to find bugs by ordering thousands of different menus (neural networks). They would say, "Let's try a menu with 100 layers of lasagna!" or "Let's try a menu with 500 tiny dumplings!"

The Flaw: This is like testing a kitchen by changing the menu, but never checking if the robot knows how to hold a knife for a specific, weirdly shaped vegetable. The bugs aren't in the menu; they are in how the robot handles a specific ingredient size.

3. The New Way: The "Rulebook" Inspector (GPU-Fuzz)

GPU-Fuzz changes the game. Instead of guessing random menus, it acts like a strict Rulebook Inspector.

Step 1: The Rulebook (Modeling): The researchers wrote down the strict laws of physics for every robot task. For example, "If you are chopping a carrot that is 10 inches long, your knife can only go so far." They turned these laws into math equations.
Step 2: The Magic Calculator (Constraint Solver): They used a super-smart calculator (called Z3) to solve these equations. But instead of just finding one answer, they told the calculator: "Find me the weirdest, most extreme answers possible!"
- Analogy: Imagine asking a calculator: "Give me a number bigger than 10." It says "11."
- GPU-Fuzz says: "No, give me a number bigger than 10, but not 11, and not 12, and make it a number that looks like a hash code." It forces the calculator to dig deep into the "weird corners" of the math.
Step 3: The Stress Test: The system takes these weird, extreme numbers (like a stride of 200 or a kernel size of 5) and feeds them to the robots in PyTorch, TensorFlow, and PaddlePaddle.
Step 4: The Safety Net: They use a tool called Compute-Sanitizer (like a high-speed camera) to watch the robots. If a robot even thinks about grabbing a jar from the wrong shelf, the camera flashes red and stops the robot.

4. The Results: 13 Hidden Bombs

By using this "Rulebook" approach, GPU-Fuzz found 13 previously unknown bugs in the world's most popular AI kitchens.

Some were Silent Corruptions: The robot grabbed the wrong ingredient, the soup tasted weird, but no one noticed until someone got sick.
Some were Explosions: The robot tried to do math that was too big for its brain, causing the whole system to freeze.

The Big Takeaway

Think of it this way:

Old Inspectors checked if the recipe made sense.
GPU-Fuzz checks if the chef's hands are safe when handling specific, weirdly shaped ingredients.

The authors realized that to keep AI safe, we can't just look at the big picture (the neural network); we have to zoom in and stress-test the tiny, specific rules that govern how the computer memory is touched.

In short: GPU-Fuzz is a specialized tool that uses math to force AI systems to try the most impossible, weird combinations of settings, revealing hidden cracks in the foundation that were previously invisible.

1. Problem Statement

Deep Learning (DL) frameworks (e.g., PyTorch, TensorFlow, PaddlePaddle) rely heavily on GPUs for performance. However, the low-level CUDA kernels implementing these frameworks are prone to memory errors (e.g., out-of-bounds access, misaligned writes, race conditions). These errors can lead to:

System Crashes: Disrupting critical applications like autonomous driving or medical imaging.
Silent Data Corruption: Producing incorrect results without triggering errors, which is particularly dangerous for AI reliability.
Security Vulnerabilities: Exploitable via Return-Oriented Programming (ROP) or code tampering due to GPU memory layout characteristics (lack of W⊕X permissions).

The Gap: Existing fuzzers for DL systems (e.g., NNSmith) focus on generating diverse neural network structures to find compiler-level arithmetic errors or numerical inconsistencies. They fail to systematically explore the operator parameter space (e.g., tensor shapes, strides, padding, dilation) where low-level memory bugs reside. Consequently, boundary-value memory errors in CUDA kernels remain largely undetected.

2. Methodology: GPU-Fuzz

GPU-Fuzz is a constraint-guided fuzzer designed specifically to target memory errors at the operator level. Its architecture consists of three main phases:

A. Operator Modeling

Instead of modeling entire neural networks, GPU-Fuzz abstracts individual DL operators (e.g., Convolution, Pooling, Padding) into formal constraint models.

Symbolic Variables: Parameters like input size ( $H_{in}$ ), kernel size ( $K$ ), stride ( $S$ ), and padding ( $P$ ) are treated as symbolic variables.
Constraint Extraction: The system encodes the semantic and mathematical rules of each operator into formal constraints (e.g., $H_{out} = \lfloor \frac{H_{in} + 2P - D(K-1) - 1}{S} \rfloor + 1$ ).
Manual Verification: The authors manually extracted and cross-verified constraints for 13 operator families to ensure semantic correctness.

B. Constraint-Based Test Case Generation

GPU-Fuzz uses an SMT solver (Z3) to generate test cases that specifically probe boundary conditions.

Iterative Search: Unlike standard solvers that return a single solution, GPU-Fuzz employs an iterative constraint-guided search. It starts with a valid solution, then randomly selects a parameter to exclude its current value (e.g., $stride \neq 10$ ) and re-solves.
Hash-Based Diversity: To prevent the solver from returning similar values (e.g., $stride=11$ vs $stride=10$), the system adds hash-based constraints (e.g., $h(stride) \neq h(10)$ ). This forces the solver to explore distinct regions of the parameter space, maximizing coverage of edge cases.

C. Cross-Framework Execution

Translation: Generated abstract parameters are translated into concrete API calls for multiple frameworks (PyTorch, TensorFlow, PaddlePaddle).
Runtime Analysis: Each execution is wrapped with NVIDIA's compute-sanitizer, a tool that monitors GPU memory access in real-time to detect out-of-bounds reads/writes, misaligned accesses, and other memory violations that standard API calls might miss.

3. Key Contributions

Novel Fuzzing Paradigm: Shifts the focus from network-structure fuzzing to operator-parameter fuzzing, addressing a blind spot in existing DL security research.
System Implementation: Developed GPU-Fuzz, a system that combines formal constraint modeling with SMT solving to systematically explore the parameter space of 13 major operator families.
Discovery of Critical Bugs: Uncovered 13 previously unknown bugs in major frameworks (PyTorch, TensorFlow, PaddlePaddle), including 7 distinct memory access violations.
Demonstration of Silent Errors: Highlighted that many critical bugs are "silent" (no API crash) and only detectable via low-level memory debuggers, emphasizing the need for specialized testing tools.

4. Results and Evaluation

The authors evaluated GPU-Fuzz against the state-of-the-art fuzzer NNSmith on a server with an NVIDIA H100 GPU.

Bug Discovery:
- Total: 13 unique bugs found across three frameworks.
- Types: Included Out-of-Bounds (OOB) writes/reads, misaligned writes, integer overflows in grid dimension calculations, and invalid launch configurations.
- Severity: 5 bugs were silent memory corruptions (no crash, but data corruption), which are the most dangerous type of vulnerability.
Comparative Performance (vs. NNSmith):
- Test Case Generation: GPU-Fuzz generated ~51,860 test cases (3x more than NNSmith's ~19,000) in the same timeframe.
- Bug Relevance: NNSmith primarily found numerical inconsistencies (293 bugs) but zero memory errors. GPU-Fuzz found 26 ± 5 critical memory errors and 80 configuration errors, with zero numerical inconsistencies.
- Conclusion: The two tools are complementary; NNSmith covers compiler/numerical issues, while GPU-Fuzz covers low-level memory safety.

Case Study Example

The paper details a bug in PyTorch's ConvTranspose2d.

Trigger: A specific combination of large stride (200) and input dimensions generated by the fuzzer.
Root Cause: An integer overflow in the C++ host code where a 64-bit element count was cast to a 32-bit integer. This resulted in an undersized CUDA grid dimension.
Consequence: Threads calculated indices exceeding the allocated buffer, causing an out-of-bounds write. This was a silent corruption that compute-sanitizer detected but the Python API did not report.

5. Significance

Security Impact: GPU memory errors can compromise the integrity of AI systems, leading to silent data corruption or security exploits (e.g., ROP attacks). GPU-Fuzz provides a mechanism to detect these before deployment.
Methodological Shift: The paper argues that securing modern AI requires a dual approach: testing network models (for compiler bugs) AND testing operator parameters (for kernel memory bugs).
Practical Utility: By identifying 13 bugs in widely used frameworks, many of which were confirmed or fixed by developers, the work demonstrates the immediate value of constraint-guided fuzzing for the DL ecosystem.

Limitations & Future Work:

Manual Effort: Modeling constraints requires significant manual effort (approx. 100-150 lines of code per operator family). Future work aims to automate constraint extraction from documentation.
Oracle Limitations: The current oracle (compute-sanitizer) detects memory errors but not silent numerical correctness issues, suggesting a need for differential fuzzing against CPU implementations in the future.

GPU-Fuzz: Finding Memory Errors in Deep Learning Frameworks

1. The Problem: The "Silent" Crash

2. The Old Way: Guessing the Menu

3. The New Way: The "Rulebook" Inspector (GPU-Fuzz)

4. The Results: 13 Hidden Bombs

The Big Takeaway

1. Problem Statement

2. Methodology: GPU-Fuzz

A. Operator Modeling

B. Constraint-Based Test Case Generation

C. Cross-Framework Execution

3. Key Contributions

4. Results and Evaluation

Case Study Example

5. Significance

More like this

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

Spectral Edge Dynamics Reveal Functional Modes of Learning

S3S^3S3: Stratified Scaling Search for Test-Time in Diffusion Language Models

$S^3$ : Stratified Scaling Search for Test-Time in Diffusion Language Models