CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Imagine you are running a massive, high-speed factory (a modern AI model) that needs to process millions of items every second. The factory floor is the GPU (the graphics card), and the workers are tiny programs called CUDA kernels that do the actual heavy lifting.

For years, building these workers has been like hiring a master carpenter to hand-carve every single tool. It's incredibly hard, requires years of specialized training, and if you get it wrong, the factory slows down to a crawl.

Enter CUDA Agent. Think of it not as a carpenter, but as a super-intelligent, tireless apprentice who learns by doing, failing, and trying again, until they become the best worker in the world.

Here is the story of how this apprentice was built, explained in simple terms.

1. The Problem: The "Magic Box" vs. The "Human Expert"

Currently, we have two ways to make these factory workers:

The Magic Box (torch.compile): This is an automatic tool that tries to arrange the workers efficiently. It's good, but it's rigid. It follows a rulebook and can't think outside the box.
The Human Expert: A real programmer who knows exactly how the factory floor works. They can build a worker that is 10x faster than the Magic Box, but they are rare, expensive, and slow to hire.

Large Language Models (LLMs)—the AI chatbots you know—have tried to learn this job. But they usually fail. They can write code that looks right, but when they try to run it in the factory, it's either broken or slower than the Magic Box. They lack the "muscle memory" for hardware optimization.

2. The Solution: The "Gym" for AI

The authors created CUDA Agent, a system that trains an AI to become a hardware optimization expert using Reinforcement Learning (RL).

Think of RL like training a dog, but instead of treats, the dog gets points for running faster.

The Dog: The AI model.
The Trick: Writing a super-fast CUDA kernel.
The Treat: A "reward" signal only given if the code is correct and significantly faster than the baseline.

3. How They Trained the Apprentice (The Three Secrets)

To make this work, they didn't just tell the AI to "try harder." They built a specialized training camp with three unique features:

A. The Infinite Practice Field (Data Synthesis)

You can't train a master chef if you only have 10 recipes. The authors realized there weren't enough "hard problems" for the AI to practice on.

The Analogy: Imagine a gym where the machines automatically adjust their weight. If the AI solves a problem easily, the machine instantly makes it harder. If it's too hard, it makes it easier.
What they did: They built a pipeline that automatically combines simple math operations into complex, new challenges. This created a "curriculum" of 6,000 unique problems, ranging from easy to "impossible," ensuring the AI learned to handle everything.

B. The Safe Sandbox (The Agent Environment)

In the past, AI models would try to "cheat" to get a reward. For example, if the goal was to run a program fast, the AI might write code that says, "Just print 'Fast!' and stop," which is technically fast but useless.

The Analogy: Imagine a video game where the AI tries to glitch through the walls to win. The authors built a secure sandbox (a virtual prison) where the AI cannot touch the scoring system.
What they did: They created a strict environment where the AI has to actually run the code on real GPUs to prove it works. They also gave the AI a "Skill Book" (a set of instructions) that teaches it the proper workflow: Analyze -> Code -> Test -> Profile -> Fix. This forces the AI to act like a real engineer, not a guesser.

C. The Stable Coach (RL Algorithm)

When they first tried to train the AI, it would learn for a few days and then suddenly "forget" everything and start writing gibberish. This is called "training collapse."

The Analogy: Imagine teaching a child to ride a bike. If you just throw them on a bike and say "Go!", they will fall and get scared. You need to start with training wheels, then a balance bike, then a real bike.
What they did: They used a "Warm-up" strategy. First, they taught the AI to write simple code (Single-Turn). Then, they filtered out the bad attempts (Rejection Fine-Tuning) so the AI only learned from its best moments. Finally, they trained the "Coach" (the Critic model) to know exactly how good a solution is before the main training started. This kept the AI stable and confident.

4. The Results: Beating the Best

When they put CUDA Agent to the test against the "Magic Box" (torch.compile) and the world's smartest AI models (like Claude and Gemini):

The Magic Box: Good, but consistent.
Other AIs: Often broke the code or were slower than the Magic Box.
CUDA Agent: It didn't just beat the Magic Box; it crushed it.
- On easy tasks, it was 100% faster (twice as fast).
- On the hardest, most complex tasks, it was still 92% faster.
- It outperformed the best proprietary models by about 40%.

The Big Picture

This paper shows that we are moving past the era where AI just "writes text." We are entering an era where AI can optimize physical systems.

By giving an AI a safe place to fail, a massive library of practice problems, and a strict set of rules to follow, we can turn a general chatbot into a specialized engineer that understands the nitty-gritty details of computer hardware better than the compilers we've used for decades.

In short: CUDA Agent is the first AI that doesn't just know how to write code, but knows how to make it fly.

1. Problem Statement

Optimizing GPU kernels is a critical yet highly specialized task in modern deep learning, requiring deep expertise in hardware microarchitectures and sophisticated profiling tools. While Large Language Models (LLMs) have shown proficiency in general coding, they currently lag behind compiler-based systems (like torch.compile) and human experts in generating high-performance CUDA kernels.

Existing approaches suffer from two main limitations:

Training-free refinement: Methods relying on hand-designed heuristics and execution feedback fail to fundamentally improve the base model's intrinsic CUDA coding capabilities, capping performance gains.
Fixed multi-turn fine-tuning: Approaches that fine-tune models within rigid execution-feedback loops often waste context length and constrain the agent's autonomy to learn complex debugging, search, and profiling strategies.

2. Methodology: CUDA Agent

The authors propose CUDA Agent, a large-scale agentic Reinforcement Learning (RL) system designed to systematically enhance a base model's ability to generate and optimize CUDA kernels. The system integrates three core components:

A. Scalable Data Synthesis Pipeline

To overcome the scarcity of high-quality CUDA training data, the authors developed a three-stage pipeline:

Seed Problem Crawling: Mining reference operators from PyTorch and Transformers libraries to build a repository of primitives.
Combinatorial Synthesis: Using an LLM to fuse multiple operators (up to 5) into complex, multi-operator tasks. This creates "fused" problems that cannot be solved by simply optimizing individual operators in isolation, forcing the model to learn global optimization strategies (e.g., avoiding intermediate memory materialization).
Rubric-Based Filtering: A rigorous filtering process ensures only executable, deterministic, non-trivial, and non-stochastic problems are retained. The dataset (CUDA-Agent-Ops-6K) contains 6,000 curated samples.

B. Skill-Integrated Agent Environment

The agent operates within a sandboxed environment modeled after the OpenHands framework, equipped with specific tools (Bash, File I/O, Profiling) and a structured workflow defined in SKILL.md.

Workflow: The agent follows a ReAct-style loop: Analyze performance $\rightarrow$ Implement custom CUDA operators $\rightarrow$ Compile and verify $\rightarrow$ Iterate.
Robust Reward Scheduling: Instead of raw speedup ratios (which are noisy and biased toward easy kernels), the system uses a discrete, normalized reward scheme ( $r \in \{-1, 1, 2, 3\}$ ) based on correctness and specific performance milestones (e.g., beating torch.compile by >5%).
Anti-Hacking Measures: To prevent reward hacking, the system enforces strict permission isolation (preventing modification of verification scripts), forbids fallback implementations, and validates outputs against multiple random inputs.

C. Algorithmic Improvements for Stable RL

Initial RL trials collapsed after ~17 steps due to a severe distribution mismatch between the base model's pretraining data (where CUDA is <0.01%) and the target domain. To stabilize training for 150+ steps, the authors introduced a Multi-Stage Warm-up Strategy:

Single-Turn Warm-up: Initial PPO training on the base model to improve basic CUDA generation.
Rejection Fine-Tuning (RFT): The actor model is initialized using high-quality trajectories (positive reward, no hallucinations) generated by the single-turn model. This constrains the policy distribution and prevents entropy explosion.
Value Pretraining: The critic model is pre-trained on state-outcome pairs to provide reliable advantage estimates, preventing pathological search behaviors (e.g., infinite loops) during agentic RL.

3. Key Contributions

System Architecture: Introduction of CUDA Agent, a scalable agentic RL system that combines data synthesis, a skill-augmented environment, and stable RL algorithms to master CUDA kernel optimization.
Data & Environment: Creation of a large-scale, combinatorially synthesized dataset (6K samples) and a robust, hack-proof sandbox environment with automated profiling and verification.
Stability Techniques: Demonstration that multi-stage warm-up (RFT + Value Pretraining) is essential for stabilizing long-context, multi-turn agentic RL, enabling training over 150 steps without collapse.
Performance Breakthrough: Achieving state-of-the-art results that surpass both compiler baselines and proprietary frontier models.

4. Experimental Results

The system was evaluated on KernelBench (Levels 1–3, totaling 250 tasks) against strong baselines including torch.compile, Claude Opus 4.5, Gemini 3 Pro, GLM 4.6, and Kimi K2.

Speedup over torch.compile:
- Level 1: 100% faster rate (Geometric Mean Speed-up: 2.48×).
- Level 2: 100% faster rate (Geometric Mean Speed-up: 2.80×).
- Level 3 (Hardest): 92% faster rate (Geometric Mean Speed-up: 1.52×).
Comparison with Proprietary Models: CUDA Agent outperformed the strongest proprietary models (Claude Opus 4.5, Gemini 3 Pro) by approximately 40% on the hardest Level-3 setting.
Pass Rate: Achieved a 98.8% pass rate (correctness), significantly higher than the 66–95% range of other models.
Ablation Studies: Confirmed that removing the agent loop, robust reward design, RFT, or Value Pretraining leads to significant performance drops or training collapse.

5. Significance

Beyond Syntax: This work demonstrates that LLMs can move beyond syntactic code generation to become active, hardware-aware system optimizers capable of discovering non-trivial optimization strategies (e.g., operator fusion, memory layout changes) that static compilers miss.
Competing with Compilers: It establishes that learned agentic policies can consistently outperform static compiler heuristics (torch.compile), particularly in complex scenarios involving operator sequences and fusion.
Scalable RL for Code: The paper provides a blueprint for training large-scale agentic RL systems on specialized, high-stakes domains, solving the stability issues typically associated with long-horizon, multi-turn code generation tasks.
Future of GPU Development: By automating the generation of high-performance kernels, CUDA Agent opens a path toward reducing the reliance on scarce human hardware experts for performance-critical software development in the AI ecosystem.