AgentRivet: an automated system for producing Rivet… — Plain-Language Explanation

Original authors: Antonio J. Costa, Caterina Doglioni, Christian Gütschow, Andrew D. Pilkington, Sukanya Sinha

Published 2026-06-12

📖 4 min read🧠 Deep dive

Original authors: Antonio J. Costa, Caterina Doglioni, Christian Gütschow, Andrew D. Pilkington, Sukanya Sinha

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine the world of particle physics as a massive, high-stakes cooking competition. Scientists at giant machines (like the Large Hadron Collider) cook up complex "dishes" (collisions of particles) and write detailed recipes in scientific papers. They also provide a list of ingredients (data) so other chefs can try to recreate the dish.

However, there's a problem: To truly taste and compare these dishes, other scientists need a specific, standardized kitchen tool called Rivet. Think of Rivet as a specialized, high-tech measuring cup that ensures everyone is measuring the soup the exact same way. Without it, you can't fairly compare your soup to someone else's.

The trouble is, only about 40% of the published recipes come with this special measuring cup. The rest are just written descriptions, which are hard to turn into the precise code needed for the tool.

Enter AgentRivet: The AI Sous-Chef

The authors of this paper built a new system called AgentRivet. Think of it as a team of AI robots designed to read those messy, text-only recipes and automatically build the missing Rivet measuring cups (computer code) for you.

Here is how their "kitchen team" works, using a simple workflow:

The Analyst (The Reader): This AI robot reads the scientific paper and acts like a very careful sous-chef. It doesn't just read; it extracts the exact instructions: "Use 2 lemons," "Chop the onions this way," "Cook for 10 minutes." It turns the messy text into a clean, structured shopping list.
The Coder (The Builder): This robot takes the shopping list and tries to build the actual Rivet tool (which is written in a specific computer language called C++). It's like a robot arm trying to assemble a complex machine based on the instructions.
The Reviewers (The Inspectors): Before the tool is finished, two inspectors check the work.
- The Code Reviewer checks for technical errors, like using the wrong type of screw or a broken part (syntax errors).
- The Physics Reviewer checks if the instructions actually match the recipe. Did the robot measure the onions correctly? Did it follow the cooking time?

The "Taste Test" (The Results)

The team tested this AI team on two very recent and complex recipes from the ATLAS and CMS experiments (two major particle physics labs). They asked the AI to build the Rivet tools from scratch.

The Good News: The AI team was surprisingly good at the job. They built working tools with very few technical glitches. When they used the tools to measure simulated particle collisions, the results looked very similar to what the human scientists expected.
The Bad News (The "Hallucinations"): Sometimes, the AI got confused by vague parts of the recipe.
- If the paper said, "Do something special with the sauce," but didn't explain exactly how, the AI would guess. Sometimes it guessed right; sometimes it guessed wrong.
- One AI model (Gemini) sometimes forgot to follow specific instructions about "neutrinos" (a type of invisible particle), while another (Claude) sometimes got stuck in a loop or wrote down its own "thoughts" instead of just the code.
- The AI struggled the most with the most complex, abstract parts of the recipes, like measuring the "shape" of the event or using complex math formulas that weren't clearly defined.

The Verdict

The paper concludes that AgentRivet is a promising new tool. It can successfully turn about 40% of the "missing" recipes into working code, which is a huge help to the physics community.

However, it's not perfect yet. It still needs a human to look over its shoulder, especially when the original recipe is vague. The authors suggest that in the future, they will teach the AI better by training it on more examples and adding automatic checks to catch errors before a human even sees them.

In short: AgentRivet is an automated team that reads science papers and builds the missing software tools scientists need to compare their data. It works well, but it still makes mistakes when the instructions are unclear, so human experts are still needed to double-check the work.

Technical Summary of AgentRivet: An Automated System for Producing Rivet Routines from Journal Publications

Problem Statement
Particle physics collider experiments rely on Rivet (Robust Independent Validation of Event Generators), a C++ toolkit, to preserve analysis definitions and enable model-independent comparisons between theoretical predictions and experimental data. Despite the clear benefits of this preservation strategy, analysis coverage is critically incomplete. Currently, only 39% of measurements have documented and publicly available Rivet routines, with coverage ranging from 49% at ATLAS to 16% at ALICE. The production of these routines is often viewed as a labor-intensive task that is not sufficiently recognized or rewarded within the community, creating a bottleneck in the preservation of collider data.

Methodology: The AgentRivet Workflow
To address this gap, the authors designed and implemented AgentRivet, an autonomous, multi-step workflow based on Large Language Models (LLMs). The system is built as a modular, provider-agnostic Python framework that orchestrates specialized AI agents to extract physics information from journal publications and generate corresponding Rivet routines.

The workflow consists of the following key components:

Modular Agent Architecture: The system decouples high-level orchestration from specific LLM providers (OpenAI, Anthropic, Google), allowing for dynamic switching between models.
Specialized Agents:
- Analyst: Extracts structured physics information from publications, including fiducial phase-space definitions, object constructions (e.g., dressed leptons, jets), event-selection criteria, and histogram specifications. It utilizes Pydantic models to enforce structured output schemas.
- Coder: Generates Rivet-compatible C++ code based on the structured summary provided by the Analyst. It is constrained to use Rivet4 syntax and adheres to specific revision policies.
- Code Reviewer: Evaluates the generated code for syntax errors, deprecated Rivet3 usage, and potential compile-time issues.
- Physics Reviewer: Validates the physics fidelity of the implementation against the Analyst's extracted specification, checking for inconsistencies in object definitions, cuts, and observables.
Iterative Review Loop: A critical feature of the workflow is an iterative loop where the Coder refines the code based on feedback from both reviewers. This loop continues until approval is granted, no major issues remain, or a configurable iteration limit is reached.
Shared Memory and Artifacts: All intermediate steps, including extracted metadata, code drafts, and review comments, are stored in a shared state. This ensures the process is auditable, reproducible, and allows for the caching of expensive LLM-derived products.

Benchmarking and Experimental Setup
The performance of AgentRivet was evaluated using two recent, publicly available measurements that lacked existing Rivet routines:

ATLAS: Inclusive $W\gamma \to \ell\nu\gamma$ production, featuring complex angular observables, boost asymmetries, and neural network-based observables.
CMS: Event shape observables using charged particles inside jets, involving non-trivial definitions of jet mass, thrust, and broadening.

The system was tested using three commercial LLMs: Gpt-5.5 (OpenAI), Gemini-3.5-Flash (Google), and Claude-Opus-4.6 (Anthropic). For each setup, three independent runs were performed to assess consistency. The generated routines were compiled using Rivet-4.1.2 and applied to Monte Carlo event samples (MadGraph5_aMC@NLO and Pythia8) to verify physics outputs.

Key Results

Code Quality: AgentRivet produced competent Rivet routines with few syntax errors.
- Gpt-5.5 and Claude-Opus-4.6 generally produced routines that compiled successfully, though Claude-Opus-4.6 rarely formally approved routines despite identifying zero blockers.
- Gemini-3.5-Flash required 2–3 iterations to remove deprecated Rivet3 syntax and occasionally introduced hallucinated syntax.
- All routines could be compiled with minimal human intervention (fixing only necessary errors).
Physics Fidelity:
- Object Reconstruction: Most models correctly reconstructed standard objects (electrons, muons, photons, jets). However, subtle issues arose, such as the incorrect exclusion of "dressed" leptons or the inclusion of prompt neutrinos in jet finding, often due to ambiguous phrasing in the source papers.
- Complex Observables: The system struggled with the most complex definitions. For the ATLAS analysis, Gemini-3.5-Flash failed to construct angular observables entirely due to incomplete information extraction by the Analyst. Claude-Opus-4.6 occasionally applied constraints to the wrong system (e.g., constraining the $\ell\nu\gamma$ system instead of $\ell\nu$ ).
- Neural Network Observables: As expected, no model could construct neural network-based observables without the underlying model files, highlighting a limitation in handling "black box" definitions.
- Histogram Binning: When HepData records were unavailable, models had to infer binning from plots, leading to slight mismatches that required manual correction.
Cost and Reliability: The cost to produce a routine ranged from $1.20 to $2.20. The framework demonstrated robustness against API failures through retry logic, though access stability varied significantly by provider and time of day.

Significance and Claims
The paper claims that AgentRivet demonstrates the capability of modern LLMs to extract detailed analysis definitions from scientific literature and translate them into executable scientific software. The system successfully bridges the gap between publication and implementation, offering a potential solution to the incomplete coverage of Rivet routines.

The authors emphasize that the iterative review process is essential for improving both code quality and consistency with the original analysis. They note that while the system is not yet perfect, the majority of physics-implementation issues stem from subtle but ambiguous definitions in the original publications rather than fundamental flaws in the workflow. Consequently, the paper argues that AgentRivet provides a viable, automated pathway to increase analysis preservation, provided that the generated artifacts undergo the described quality control loops. The work contributes to the growing literature on AI agents by documenting their performance in a rigorous, domain-specific scientific context.

AgentRivet: an automated system for producing Rivet routines from journal publications

Technical Summary of AgentRivet: An Automated System for Producing Rivet Routines from Journal Publications

More like this