Development of an LLM-Based System for Automatic Code… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to recreate a famous, delicious cake that a master baker published in a magazine. The recipe isn't just a simple list of ingredients; it's a long, complex article that says things like, "Use the flour described in the 2015 baking guide" or "Whisk until the texture matches the description in the 2018 journal."

To make this cake, you need to:

Read the magazine article and all the other books it references.
Write down a clear, step-by-step shopping list and instruction manual.
Actually bake the cake in your kitchen and see if it tastes exactly like the one in the picture.

This paper is about building a super-smart robot assistant (an AI) to do this for scientists in High Energy Physics (HEP). These scientists study the smallest particles in the universe, and their "recipes" are incredibly complex computer programs written in scientific papers.

Here is how the authors tried to teach this robot to work, broken down into simple concepts:

The Problem: The "Black Box" of Science

In the past, if a scientist wanted to check if a famous experiment was done correctly, they had to read a 20-page paper and manually rewrite the computer code from scratch. It's like trying to rebuild a Ferrari engine just by reading a magazine article about it. It takes years, and it's easy to make a mistake.

The authors wanted to use Large Language Models (LLMs)—the same type of AI that writes emails or chats with you—to read these papers and automatically write the computer code needed to recreate the experiment.

The Solution: A Two-Step "Translator" Robot

The authors realized that asking an AI to "Read this paper and write the code" is like asking a child to translate a novel into a movie script in one go. It's too much pressure, and the AI might start making things up (a problem called "hallucination").

Instead, they built a two-stage assembly line:

Stage 1: The "Note-Taker" (Extraction)

First, the AI acts like a very organized student. It reads the main paper and all the other papers it mentions. Its job is to pull out the specific rules (like "only keep particles that are heavier than X") and write them down in a neat, structured list.

The Analogy: Imagine the AI is a detective reading a mystery novel. Instead of writing the whole story, it just writes down a list of clues: "The butler was in the library," "The candle was blue," etc.
The Result: The AI got pretty good at finding the clues, especially the bigger, smarter models. However, sometimes it got confused or invented clues that weren't there.

Stage 2: The "Chef" (Code Generation)

Once the AI has the neat list of rules, it moves to the second stage. Now, it acts like a chef trying to cook the dish based only on that list. It writes the actual computer code, runs it, and checks if the result matches the original experiment.

The Analogy: The AI takes the detective's list of clues and tries to build a Lego castle that looks exactly like the one in the photo.
The Result: Sometimes, the AI built a perfect castle. But often, it built a wobbly tower that fell over, or a castle that looked right but had the wrong number of windows.

The Big Challenges

The authors found three main things that make this robot not quite ready for prime time:

The "Daydreaming" Problem (Hallucination): Sometimes the AI is so confident that it invents facts. It might say, "The paper said to use a red hammer," when the paper actually said "blue." In science, a red hammer ruins the whole experiment.
The "Mood Swing" Problem (Stochasticity): If you ask the AI to do the same task twice, it might give you two different answers. One time it gets it right; the next time, it fails. This makes it hard to trust.
The "Running Out of Breath" Problem: The papers are so long and complex that the AI sometimes forgets the beginning of the sentence by the time it gets to the end.

The Verdict: A Helpful Assistant, Not a Boss

The authors conclude that these AI robots are not yet ready to work alone. You cannot just let them run the experiment and hope for the best.

However, they are amazing "Co-Pilots."

The Human-in-the-Loop: The best way to use this system is for a human scientist to sit next to the robot. The robot does the heavy lifting (reading 50 pages and writing 100 lines of code), and the human checks the work.
The Safety Net: If the robot makes a mistake, the human catches it. If the robot gets stuck, the human helps it out.

Why This Matters

If this system gets better, it could change science forever. It would mean that:

New students could understand complex physics papers much faster.
Old experiments could be re-checked easily to make sure no mistakes were made years ago.
Science becomes more transparent, because the "recipe" is automatically checked against the "dish."

In short, the authors built a prototype robot that can read science papers and try to recreate the experiments. It's not perfect yet—it still daydreams and makes mistakes—but with a human friend watching over its shoulder, it's a powerful tool for making science more reliable and accessible.

1. Problem Statement

High-energy physics (HEP) data analysis is becoming increasingly complex, requiring significant computational expertise and time to set up environments and write code. This creates a high barrier to entry for students and newcomers. While Large Language Models (LLMs) offer potential for automating coding tasks, fully automated analysis remains untrustworthy due to stochastic variation (randomness in outputs) and hallucinations (fabricating facts). There is a critical need for a framework that can:

Extract analysis procedures directly from HEP publications (PDFs).
Generate executable code to reproduce published results.
Operate within a human-in-the-loop framework to ensure verifiability and reliability.

2. Methodology

The authors developed a Proof-of-Concept (PoC) system using open-weight LLMs. The workflow is divided into two distinct stages, implemented using LangChain, LangGraph, and vLLM.

Stage 1: Selection Extraction

Goal: Extract event selection criteria, object definitions, and analysis procedures from a target paper and its cited references.
Process:
1. Planner: Determines the next reference to read and formulates specific reading objectives to avoid noise.
2. Loader: Converts PDFs to Markdown (using marker), isolates relevant text, and maps citations to arXiv IDs.
3. Reader: Extracts criteria based on the planner's objective. It supports two modes:
  - Bulk: Processes full text at once.
  - Chunk: Processes sequential segments (for models with limited context windows), filtering for relevance.
4. Merger: Integrates new findings into a structured list, treating cited references as supplementary to avoid overwriting primary definitions.
Output: A structured, human-readable selection list with comments and reference provenance (not just a raw code block).

Stage 2: Code Generation

Goal: Generate executable analysis code based on the structured selection list from Stage 1.
Process:
1. Planner: Decomposes the task into subtasks with concrete completion criteria.
2. Generator: Produces code for the current subtask, incorporating feedback from previous validation attempts and runtime errors.
3. Executor: Runs code in an isolated Singularity container (pre-configured with ROOT, numpy, uproot) to ensure security and reproducibility.
4. Validator: Checks both the execution logs and the code logic against the criteria. If validation fails, the system iterates; if successful, it moves to the next subtask.
Constraint: The system currently relies on explicit prompts for domain knowledge (variables, APIs) rather than an autonomous Retrieval-Augmented Generation (RAG) system, isolating the evaluation to code translation capabilities.

3. Benchmark and Evaluation Protocol

Dataset: ATLAS Open Data (2015–2016) for the $H \to ZZ^* \to 4\ell$ analysis.
Ground Truth: A manually reproduced baseline implementation and a curated list of 27 explicitly identifiable selection criteria.
Metrics:
- Step 1: Number of correctly extracted cuts vs. hallucinated (contradictory) cuts.
- Step 2: Event-level agreement with the baseline. Results categorized as:
  - Exactly Matched: Identical event list to the baseline.
  - Not Matched: Valid execution but different events.
  - Execution Failed: Runtime errors or inability to resolve subtasks.
Models Tested:
- Open-Weight: Qwen3 (4B, 30B, 235B), Qwen3-Coder (30B, 80B Next), GPT-OSS (120B).
- Commercial Baselines: Gemini 2.5 Flash-Lite, Gemini 2.5 Flash.

4. Key Results

Selection Extraction (Step 1)

Performance: Models with ≥30B parameters (e.g., Qwen3:235B, Gemini 2.5 Flash) successfully identified all 27 ground-truth cuts in some runs. The 4B model performed poorly.
Stochasticity: Significant run-to-run variation was observed. Even top models produced contradictory statements (hallucinations) in some runs, necessitating human verification.
Chunk vs. Bulk: For the smaller 4B model, the "Chunk" mode recovered more correct cuts than "Bulk" but drastically increased hallucinations and workflow failure rates (7/10 failures vs. 0/10).

Code Generation (Step 2)

Performance:
- Qwen3-Coder-Next (80B): Achieved exact event-level matches in 3/10 runs.
- GPT-OSS (120B): Achieved exact matches in 2/10 runs.
- Qwen3-Coder (30B): Achieved 0/10 exact matches.
Failure Modes: High rates of "Not Matched" and "Execution Failed" outcomes were observed across all models.
Critical Finding: The existence of code that executes successfully but yields incorrect event selections proves that execution success is not a proxy for physical correctness. Strict event-level verification is mandatory.

5. Key Contributions

Two-Stage Workflow: A novel pipeline separating document understanding (extraction) from code generation, using a structured intermediate representation to make the LLM a verifiable collaborator rather than a black box.
Quantitative Evaluation: A rigorous benchmark using ATLAS Open Data to separately evaluate the capabilities and limitations of open-weight LLMs in HEP reproduction.
Human-in-the-Loop Framework: Demonstrated that while LLMs are not yet fully autonomous agents, they are promising tools for supporting reproducibility when integrated with human oversight and iterative validation.

6. Significance and Future Directions

Reproducibility Support: The system acts as a tool to assess the reproducibility of HEP publications. Successful reproduction indicates sufficient documentation, while failure highlights missing or ambiguous descriptions.
Current Limitations: Persistent stochasticity, hallucinations, and the instability of PDF-to-text conversion for complex layouts remain major bottlenecks.
Future Work:
- End-to-End Evaluation: Quantifying how errors in Step 1 propagate to Step 2.
- RAG Integration: Incorporating Retrieval-Augmented Generation to handle HEP-specific domain knowledge (e.g., ROOT APIs, variable definitions) autonomously.
- Expanded Benchmarks: Testing on a wider variety of analyses to assess generalization.
- Ambiguity Detection: Developing the system to explicitly flag ambiguities in published procedures to improve paper quality before publication.

Conclusion: The study concludes that open-weight LLMs (≥30B parameters) are capable of extracting complex criteria and generating baseline-matching code, but they are not yet reliable enough for fully autonomous analysis. They are best utilized as human-in-the-loop collaborative tools to enhance the transparency and reproducibility of High-Energy Physics research.

Development of an LLM-Based System for Automatic Code Generation from HEP Publications