CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

This paper introduces CUDABench, a comprehensive benchmark designed to evaluate Large Language Models' text-to-CUDA generation capabilities across diverse domains by assessing compilation correctness, functional consistency, and performance via a novel roofline-based metric.

Jiace Zhu, Wentao Chen, Qi Fan, Zhixing Ren, Junying Wu, Xing Zhe Chai, Chotiwit Rungrueangwutthinon, Yehan Ma, An Zou

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you have a brilliant, hyper-intelligent robot assistant (an LLM) that can write code. You ask it to write a recipe for a complex dish, and it does a great job. But now, you ask it to write a recipe specifically for a high-performance racing engine that needs to run on a specific, incredibly fast, and temperamental machine (a GPU).

This is the challenge the paper CUDABench tackles.

Here is the breakdown of the paper using simple analogies:

1. The Problem: The "Chef" vs. The "Race Car Mechanic"

Previously, researchers tested these AI "chefs" by asking them to translate a recipe written in one language (like Python) into another (CUDA, the language for GPUs). It's like asking a chef to translate a French menu into English. They just have to swap the words; the cooking instructions are already there.

But in the real world, you often just say, "Make me a fast engine for this specific car," without giving them the blueprint. This is Text-to-CUDA. The AI has to invent the engine from scratch based only on a text description.

The paper argues that previous tests were too easy because they didn't test if the AI could actually design the engine, only if it could translate the instructions. Also, just because an engine starts doesn't mean it's fast. A slow engine is useless for a race car.

2. The Solution: CUDABench (The Ultimate Driving Test)

The authors built a new, massive testing ground called CUDABench. Think of it as a driving school with three specific types of tests:

  • Breadth (The Variety of Roads): They didn't just test on a straight highway. They tested the AI on six different "terrains":
    • Linear Algebra (Doing math puzzles).
    • Deep Learning (Teaching the car to recognize cats).
    • Computer Vision (Seeing the road).
    • Data Analytics (Counting millions of cars in traffic).
    • Signal Processing (Listening to radio waves).
    • Science & Finance (Predicting weather or stock prices).
  • Depth (The Traffic Density): They tested the AI with tiny amounts of data (a single car) and massive amounts (a traffic jam of millions). The AI needs to handle both a quiet street and a gridlock.
  • Difficulty (The Instructions):
    • Level 1 (The Guided Tour): "Build an engine. Here is the blueprint, the tools, and the exact steps."
    • Level 2 (The Schematic): "Build an engine. Here is the blueprint, but you figure out the tools and steps."
    • Level 3 (The Whisper): "Build an engine. That's it. Good luck." (The AI has to remember everything from its training).

3. The Scorecard: The "Roofline" Metric

How do you know if the AI's engine is good?

  • Old Way: "How fast did it finish the lap?" (Execution Time).
    • Problem: If you test on a Ferrari vs. a Toyota, the Ferrari wins even if the driver is bad. It's unfair.
  • New Way (CUDABench-Score): They use a Roofline Model. Imagine a ceiling (the roof) representing the absolute maximum speed that specific car could possibly go.
    • If the AI's engine hits 90% of the ceiling, it's a genius.
    • If it only hits 10% of the ceiling, it's a disaster, even if it technically "works."
    • This score is hardware-independent. It tells you how well the AI utilized the car's potential, regardless of whether the car is a Ferrari or a Toyota.

4. The Results: The AI is a "Good Talker, Bad Builder"

The authors tested the smartest AI models available (like GPT-5, Claude, etc.) and found some surprising things:

  • The "Syntax" Trap: The AIs are amazing at grammar. They can write code that looks perfect and compiles (starts up) 99% of the time. It's like a mechanic who can assemble a car perfectly, but the engine doesn't actually run.
  • The Logic Gap: Once the code starts running, the AIs fail often. They get the words right but the logic wrong. They forget to synchronize the workers (threads) or manage the memory correctly.
  • The Knowledge Hole: When asked to do complex, niche tasks (like financial modeling or specific scientific simulations) without hints (Level 3), the AIs crumble. They know general math, but they don't know the specific "tricks of the trade" for high-performance GPU programming.
  • The Speed Issue: Even when the code works, it's slow. The AIs are leaving about 60% of the GPU's power on the table. They aren't using the car's full potential.

The Bottom Line

CUDABench is a wake-up call. It shows that while AI is great at writing code that looks right, it still struggles to write code that is fast, efficient, and truly optimized for the complex hardware of the future.

It's like having a robot that can write a beautiful poem about a race car, but when you ask it to build the race car, it builds something that sputters and barely moves. We need to teach these robots not just to speak the language of engineers, but to think like them.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →