CelloAI Benchmarks: Toward Repeatable Evaluation of AI Assistants

This paper introduces CelloAI, a suite of repeatable and automated benchmarks designed to evaluate Large Language Models on HEP/HPC-specific tasks, including code documentation, GPU kernel generation, and graphical data analysis, to address the unique constraints of scientific software development.

Original authors: Mohammad Atif, Kriti Chopra, Fang-Ying Tsai, Ozgur O. Kilic, Tianle Wang, Zhihua Dong, Douglas Benjamin, Charles Leggett, Meifeng Lin, Paolo Calafiura, Salman Habib

Published 2026-03-03
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a brilliant, but slightly naive, new intern (the AI) how to work in a massive, ancient, and incredibly complex factory. This factory doesn't build cars; it simulates the birth of the universe using supercomputers. The blueprints are thousands of pages long, written by different people over decades, and if the intern makes a tiny mistake, the whole simulation could crash or, worse, give you the wrong answer about how the universe works.

This paper is about CelloAI, a team of researchers who decided that the standard "tests" we use to check AI coding skills aren't good enough for this high-stakes factory. They built a new set of specialized training drills (benchmarks) to see if AI can actually handle the messy, real-world job of scientific coding.

Here is a breakdown of their three main drills, explained simply:

1. The "Translator" Drill (Code Documentation)

The Problem: The factory's manuals are a mess. Some parts have no labels, and the labels that exist are vague. The AI needs to write clear, structured notes (like "Doxygen" comments) for every machine part so future workers know what it does.
The Drill: They asked different AIs to write these notes.

  • The Scorecard: They didn't just check if the AI wrote something. They checked:
    • Did it miss anything? (Coverage): Did it label every single screw and wire?
    • Did it make sense? (Semantics): Did it describe the machine correctly, or did it just sound like it was speaking gibberish?
      The Result: Most AIs are great at remembering to put a label on every part (high coverage), but they often struggle to write a good description that matches what a human expert would say. It's like an intern who labels every tool but writes "This is a thing" instead of "This is a wrench for tightening bolts."

2. The "Moving Day" Drill (Code Porting)

The Problem: The factory is upgrading its machinery. They need to move a complex process from an old, slow machine (CPU) to a new, super-fast one (GPU). This is like trying to move a delicate, intricate clockwork mechanism from a wooden box to a high-tech steel case without breaking a single gear.
The Drill: They gave the AIs a specific, tricky job: take a piece of code that runs on a standard computer and rewrite it to run on a graphics card (GPU) used for high-speed physics simulations.

  • The Scorecard: They didn't just ask, "Did the code look right?" They ran the code.
    • Did it compile? (Did it turn on?)
    • Did it run? (Did it work without crashing?)
    • Did it give the right answer? (Did it simulate the physics correctly?)
      The Result: This was the hardest test.
  • Easy tasks: Simple tasks (like resetting a counter) were easy for the AIs.
  • Hard tasks: The complex "simulation" tasks were a disaster. Even the smartest AIs failed to get the code to run correctly more than 1 or 2 times out of 10. It showed that while AI is good at writing small snippets, it struggles with the "big picture" logic needed to move complex systems without breaking them.

3. The "Detective" Drill (Graphical Data Analysis)

The Problem: Scientists look at thousands of charts and graphs to find problems. If a graph looks slightly different than yesterday's, it could mean a broken machine or a new discovery. Humans are good at spotting these "glitches." Can an AI look at a picture of a graph and say, "Hey, that line is weird"?
The Drill: They showed the AIs pictures of histograms (bar charts) with a "normal" line and a "monitored" line that had some errors.

  • The Scorecard: The AI had to point out exactly where the lines didn't match and list the weird points.
    The Result: The AIs were okay at spotting the obvious errors, but they weren't great at being precise detectives. Some models missed the clues entirely, while others found them but got confused about exactly where the problem started. It's like showing a detective a crime scene photo; they can see something is wrong, but they can't always pinpoint the exact fingerprint.

The Big Takeaway

The main message of this paper is: We need better tests for AI in science.

Current tests are like asking an AI to solve a math puzzle on a clean sheet of paper. But real scientific coding is like fixing a leak in a dam while the water is rushing over it. You can't just write code; you have to make sure it fits into a giant, fragile system, compiles, runs, and doesn't break the laws of physics.

The researchers built these new "drills" so that in the future, we can fairly compare AI models and figure out which ones are actually ready to help scientists build the next generation of supercomputers, rather than just ones that are good at writing homework essays.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →