Imagine you are a food critic reviewing a new restaurant. The chef hands you a scorecard that says, "This meal is a 9.5 out of 10." But the chef refuses to show you the actual food, the recipe, or the notes on how they decided that score. They just say, "Trust me, it's a 9.5."

Now, imagine another critic tastes the exact same meal but gives it a 6.0. Without seeing the food or the recipe, you have no way of knowing who is right. Did the first critic use a different scale? Did they ignore the burnt toast? Did they count the dessert as part of the main course?

This is exactly the problem Rollout Cards aims to solve in the world of AI "agents" (smart computer programs that do tasks like writing code, browsing the web, or solving math problems).

Here is a simple breakdown of what the paper says, using everyday analogies:

The Problem: The "Black Box" Score

Currently, when researchers publish results about AI agents, they usually only share the final score (the "9.5"). They throw away the rollout record.

The Rollout Record: Think of this as the full video recording of the AI doing the task. It includes every step it took, every tool it clicked, every mistake it made, how long it took, and whether it crashed or got stuck.
The Issue: Different research teams use different "rules" to turn that video into a score.
- Team A might say, "If the AI crashes, we ignore that attempt."
- Team B might say, "If the AI crashes, that counts as a zero."
- Team C might say, "We only count the final answer, ignoring the 50 steps it took to get there."

The paper found that none of the 50 popular AI research repositories they checked reported how many attempts failed or crashed alongside their main score. It's like a sports team saying, "We won 3 games!" but hiding the fact that they lost 10 games and only counted the 3 they won.

The Evidence: Rules Change the Game

The authors audited 50 different AI tools and found 37 specific cases where changing the "rulebook" completely changed the score, even though the AI did the exact same thing.

The "MMLU" Example: The same AI model (LLaMA-65B) got a score of 63.7 under one set of rules and 48.8 under another. That's a huge difference just because of how the score was calculated, not because the AI changed.
The "SWE-bench" Example: In software engineering tasks, whether you count "failed attempts" as part of the total or throw them away changed the success rate by 15.6 percentage points.
The "MLE-Bench" Example: Depending on whether you define a "pass" as getting a gold medal or just a passing grade, the success rate of the same AI submissions dropped from 34.2% to 13.3%.

The paper argues that without the video recording (the rollout), we can't tell if the AI is actually better, or if the researcher just used a more lenient rulebook.

The Solution: The "Rollout Card"

To fix this, the authors propose a new standard called a Rollout Card.

Think of a Rollout Card like a transparent, tamper-proof recipe box that you must include with your final dish. It contains:

The Full Video: The complete record of the AI's actions, errors, and timing.
The Rulebook: A clear declaration of exactly how the score was calculated (e.g., "We ignored crashes," or "We counted every token").
The "Missing Pieces" List: A honest note saying, "We couldn't share the full video because of privacy, so here is exactly what we cut out."

This allows other scientists to look at the same video and ask different questions. Maybe the original paper only cared about "Did it finish the task?" but a new researcher wants to ask, "Did it use too much money?" or "Did it make dangerous tool calls?" With the Rollout Card, they can answer those questions without having to run the expensive experiment all over again.

What They Actually Did (The Experiments)

The authors didn't just talk about this; they tested it with real data:

Re-discovering Hidden Insights: They took four existing public datasets (from tools like GAP, MAESTRO, COPRA, and Tree-of-Thought) that had been published before. By applying the Rollout Card method, they found new facts that the original papers missed.
- Example: They found that 20% of AI responses that looked "safe" in text actually made forbidden tool calls in the background. The original score missed this because they only looked at the text.
- Example: They found that in multi-agent teams, "failures" actually involved much more coordination work than "successes," suggesting that extra work doesn't always mean better answers.
Re-grading the Same Work: They took public AI submissions (like code patches or math answers) and re-scored them using different rulebooks.
- Result: Changing only the scoring rule changed the reported scores by up to 20.9 percentage points. In some cases, it flipped the ranking, making a "worse" AI look like the "winner" just because the rulebook changed.

The Bottom Line

The paper concludes that publishing just a score is like publishing a final exam grade without the test paper. It hides the details that matter.

By introducing Rollout Cards, the authors want to make AI research reproducible. They have already released a free, open-source tool (called ERGON) and 21 public datasets (Rollout Cards) covering tasks like software engineering, web browsing, and math. This allows anyone to inspect the "video recording" behind the scores, ensuring that when we say an AI is smart, we actually know why and how we measured it.

What the paper does NOT claim:

It does not claim this will make AI safer or more powerful on its own.
It does not claim this solves all privacy issues (you still have to decide what to hide).
It does not claim this is a new way to train AI; it is a new way to report and audit the results of AI training.

Technical Summary: Rollout Cards: A Reproducibility Standard for Agent Research

Problem Statement

The paper identifies a critical reproducibility crisis emerging in agent research, mirroring historical issues in machine learning and reinforcement learning. Current practices prioritize publishing reported scores (e.g., accuracy, pass rates) while discarding the underlying rollout records (the full trace of agent-environment interactions) and the specific reporting rules used to compute those scores.

This fragmentation creates two primary failure modes:

Recording Failure: Rollout batches are scored once and discarded. Without the raw records, later researchers cannot re-analyze the same episodes to study behaviors the original report omitted (e.g., safety violations in tool calls, coordination overhead in multi-agent systems) or apply new views to the data. Re-running these experiments is often prohibitively expensive due to the rising costs of frontier model inference and the rapid obsolescence of evaluation scaffolds.
Reporting Failure: Reporting rules (the procedures converting views of rollouts into scores) vary across frameworks and are rarely disclosed. This leads to significant score discrepancies for identical underlying behaviors. The authors' audit of 50 popular repositories found that none report failed, errored, or skipped rollouts alongside headline scores. Furthermore, they documented 37 cases where differing reporting rules (e.g., token accounting, failure handling, prompt templates) resulted in dramatic score variations, sometimes changing model rankings or success rates by over 20 percentage points.

Methodology

The authors propose a shift in the unit of reproducibility from the "reported score" to the rollout record, paired with explicit declarations of how that record is processed.

The Rollout Card

The core contribution is the Rollout Card, a publication bundle designed as a minimum-sufficient specification. It consists of:

Rollout Record: A self-describing archive containing the episode evidence: task specification, environment state, agent actions (messages, tool calls), artifacts, timing, and terminal status. Crucially, it treats failures as status changes within the record rather than exceptions that bypass logging.
Reporting Rule Registry: A declaration of every view and reporting rule applied to the record to generate a reported score, including implementation details and versions.
Drops Manifest: A typed record specifying which fields, rows, or streams were read, filtered, or collapsed by a specific analysis. This explicitly documents what information was omitted, allowing future researchers to understand the limitations of a reported view.
Release-Scope Metadata: Declarations regarding redaction, licensing, and access limits.

The authors implemented a reference specification in ERGON, an open-source reinforcement learning gym, which acts as a lightweight dataset adapter to validate, map, and export these bundles.

Empirical Evaluation

The paper validates the utility of Rollout Cards through two retrospective experiments using public artifacts:

RQ1 (Reusability of Preserved Records): The authors analyzed four public releases (GAP, MAESTRO, COPRA miniF2F logs, and Tree-of-Thought) that preserved sufficient rollout evidence. They computed secondary analyses that the original papers did not report:
- GAP: Found that 20.6% of responses certified as "text-safe" actually contained forbidden tool calls, a failure invisible to text-only safety scores.
- MAESTRO: Revealed that failed multi-agent runs incurred 5x more coordination spans and 7x more tokens than successful runs, contradicting the assumption that extra collaboration always improves outcomes.
- COPRA: Showed that extended proof-search steps correlated negatively with success, suggesting repeated steps often indicate failed recovery rather than useful reasoning.
- Tree-of-Thought: Demonstrated that pruning strategies could preserve final rewards while significantly reducing wasted exploration, a nuance hidden by final reward metrics alone.
RQ2 (Impact of Reporting Rules): The authors held benchmark artifacts fixed (e.g., GPT-4o submissions to SWE-bench, Kaggle submissions for MLE-Bench) and applied alternative reporting rules.
- Changing the definition of "success" or handling of missing patches in SWE-bench altered the reported capability gap between agents by 2.3 percentage points.
- Changing the grader on $\tau$ -bench reversed the ranking of frontier models (GPT-4o vs. Claude 3.5 Sonnet) by 16.9 percentage points.
- Changing the medal/pass definition for MLE-Bench dropped the pass rate from 34.2% to 13.3% (a 20.9 point gap).

Key Contributions

Diagnosis of Publication Failures: A structured audit of 50 repositories and a catalogue of 37 reporting-rule discrepancies demonstrating that current practices hide failures and obscure the convention-driven nature of score gaps.
Rollout Card Specification: A formal publication standard that preserves the rollout record, declares the views and rules applied, and documents omissions via drops manifests.
Reference Implementation and Data Release: An open-source implementation in ERGON and the public release of 21 rollout-card exports (17 trace-publication exports and 4 analytic/recovered-view exports) covering tool use, software engineering, safety, and search.

Results

Scientific Reuse: Preserved rollout records enabled the discovery of safety failures, coordination overheads, and search inefficiencies that were not visible in the original reported scores.
Convention Sensitivity: The experiments confirmed that reporting rules are not neutral; changing them on fixed evidence can alter reported scores by up to 20.9 percentage points and invert model rankings.
Transparency: The Rollout Card structure makes the "black box" of evaluation transparent, allowing disagreements to be traced to specific reporting choices rather than ambiguous model behavior.

Significance and Claims

The paper claims that publishing only scores extracts only a fraction of the value of agent experiments. By treating rollout records as the unit of reproducibility, the community can:

Mitigate the Recording Problem: Enable new scientific questions to be asked of existing, expensive data without re-running frontier agents.
Mitigate the Reporting Problem: Make convention-driven score changes inspectable, allowing researchers to distinguish between agent behavior and the rules used to record it.

The authors are modest about the scope, noting that Rollout Cards do not prevent selective metric choice, privacy constraints, or redaction. Instead, their role is to make the record, the rule, and the omissions inspectable, ensuring that disagreements can be traced to preserved evidence, reporting choices, or actual agent behavior. The work aims to support future research, meta-analyses, and reporting-rule comparisons without requiring new, expensive frontier rollout budgets.

Rollout Cards: A Reproducibility Standard for Agent Research