Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer
Imagine you are a food critic reviewing a new restaurant. The chef hands you a scorecard that says, "This meal is a 9.5 out of 10." But the chef refuses to show you the actual food, the recipe, or the notes on how they decided that score. They just say, "Trust me, it's a 9.5."
Now, imagine another critic tastes the exact same meal but gives it a 6.0. Without seeing the food or the recipe, you have no way of knowing who is right. Did the first critic use a different scale? Did they ignore the burnt toast? Did they count the dessert as part of the main course?
This is exactly the problem Rollout Cards aims to solve in the world of AI "agents" (smart computer programs that do tasks like writing code, browsing the web, or solving math problems).
Here is a simple breakdown of what the paper says, using everyday analogies:
The Problem: The "Black Box" Score
Currently, when researchers publish results about AI agents, they usually only share the final score (the "9.5"). They throw away the rollout record.
- The Rollout Record: Think of this as the full video recording of the AI doing the task. It includes every step it took, every tool it clicked, every mistake it made, how long it took, and whether it crashed or got stuck.
- The Issue: Different research teams use different "rules" to turn that video into a score.
- Team A might say, "If the AI crashes, we ignore that attempt."
- Team B might say, "If the AI crashes, that counts as a zero."
- Team C might say, "We only count the final answer, ignoring the 50 steps it took to get there."
The paper found that none of the 50 popular AI research repositories they checked reported how many attempts failed or crashed alongside their main score. It's like a sports team saying, "We won 3 games!" but hiding the fact that they lost 10 games and only counted the 3 they won.
The Evidence: Rules Change the Game
The authors audited 50 different AI tools and found 37 specific cases where changing the "rulebook" completely changed the score, even though the AI did the exact same thing.
- The "MMLU" Example: The same AI model (LLaMA-65B) got a score of 63.7 under one set of rules and 48.8 under another. That's a huge difference just because of how the score was calculated, not because the AI changed.
- The "SWE-bench" Example: In software engineering tasks, whether you count "failed attempts" as part of the total or throw them away changed the success rate by 15.6 percentage points.
- The "MLE-Bench" Example: Depending on whether you define a "pass" as getting a gold medal or just a passing grade, the success rate of the same AI submissions dropped from 34.2% to 13.3%.
The paper argues that without the video recording (the rollout), we can't tell if the AI is actually better, or if the researcher just used a more lenient rulebook.
The Solution: The "Rollout Card"
To fix this, the authors propose a new standard called a Rollout Card.
Think of a Rollout Card like a transparent, tamper-proof recipe box that you must include with your final dish. It contains:
- The Full Video: The complete record of the AI's actions, errors, and timing.
- The Rulebook: A clear declaration of exactly how the score was calculated (e.g., "We ignored crashes," or "We counted every token").
- The "Missing Pieces" List: A honest note saying, "We couldn't share the full video because of privacy, so here is exactly what we cut out."
This allows other scientists to look at the same video and ask different questions. Maybe the original paper only cared about "Did it finish the task?" but a new researcher wants to ask, "Did it use too much money?" or "Did it make dangerous tool calls?" With the Rollout Card, they can answer those questions without having to run the expensive experiment all over again.
What They Actually Did (The Experiments)
The authors didn't just talk about this; they tested it with real data:
Re-discovering Hidden Insights: They took four existing public datasets (from tools like GAP, MAESTRO, COPRA, and Tree-of-Thought) that had been published before. By applying the Rollout Card method, they found new facts that the original papers missed.
- Example: They found that 20% of AI responses that looked "safe" in text actually made forbidden tool calls in the background. The original score missed this because they only looked at the text.
- Example: They found that in multi-agent teams, "failures" actually involved much more coordination work than "successes," suggesting that extra work doesn't always mean better answers.
Re-grading the Same Work: They took public AI submissions (like code patches or math answers) and re-scored them using different rulebooks.
- Result: Changing only the scoring rule changed the reported scores by up to 20.9 percentage points. In some cases, it flipped the ranking, making a "worse" AI look like the "winner" just because the rulebook changed.
The Bottom Line
The paper concludes that publishing just a score is like publishing a final exam grade without the test paper. It hides the details that matter.
By introducing Rollout Cards, the authors want to make AI research reproducible. They have already released a free, open-source tool (called ERGON) and 21 public datasets (Rollout Cards) covering tasks like software engineering, web browsing, and math. This allows anyone to inspect the "video recording" behind the scores, ensuring that when we say an AI is smart, we actually know why and how we measured it.
What the paper does NOT claim:
- It does not claim this will make AI safer or more powerful on its own.
- It does not claim this solves all privacy issues (you still have to decide what to hide).
- It does not claim this is a new way to train AI; it is a new way to report and audit the results of AI training.
Drowning in papers in your field?
Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.