⚛️ phenomenology

CoLLM: AI engineering toolbox for end-to-end deep learning in collider analyses

CoLLM is an AI engineering toolbox that leverages pretrained large language models and a graphical user interface to automate the generation of physically consistent event selection code and deep learning analyses, thereby lowering the programming and technical barriers for end-to-end collider analyses.

Original authors: W. Esmail, A. Hammad, M. Nojiri

Published 2026-02-09

📖 4 min read🧠 Deep dive

CC BY 4.0

Original authors: W. Esmail, A. Hammad, M. Nojiri

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a master chef (a particle physicist) who has a brilliant idea for a new dish (a scientific experiment at the Large Hadron Collider). You know exactly what flavors you want and how the ingredients should interact. However, to actually cook this dish, you have to spend hours writing a complex, line-by-line recipe in a language only a computer understands (Python code). If you make a single typo—like confusing salt for sugar—the whole dish is ruined, and you might not even notice until you taste the final result.

CoLLM is like a super-smart, specialized sous-chef who speaks both "Chef" (physics) and "Computer" (code) fluently. It takes your idea in plain English and instantly writes the perfect, error-free recipe for you, then even cooks the dish and serves it up.

Here is how CoLLM works, broken down into simple steps:

1. The "Vibe Engineering" Chef's Assistant

Usually, when people use AI to write code, they just ask for a recipe and hope for the best. This is called "vibe coding." But in science, a wrong ingredient can ruin years of work. CoLLM uses a stricter approach called "vibe engineering."

The Prompt (The Rulebook): Before the AI writes a single line of code, it is given a massive, detailed "rulebook" (a system prompt). This rulebook contains all the laws of physics, the specific way particle data is stored, and the golden rules of cooking in a collider lab. It tells the AI, "Never mix up these numbers," and "Always measure this ingredient this way."
The Translation: You type your experiment in plain English: "I want to find particles that look like this, ignore those, and measure the energy of the leftovers." The AI, guided by the rulebook, translates this into a perfect Python script.

2. The Self-Correcting Taste Test

Even the best chefs make mistakes. If the AI writes a line of code that crashes the computer (like trying to chop a rock instead of an onion), CoLLM doesn't just give up.

The Loop: It runs the code. If it breaks, the AI reads the error message, realizes, "Oh, I forgot to put a comma there," and fixes only that specific part. It tries again. It keeps doing this until the code runs perfectly. It's like a robot that keeps tasting the soup and adding a pinch of salt until it's just right, without you having to lift a spoon.

3. The Automatic Tasting Panel (Deep Learning)

Once the recipe is written and the ingredients are prepped, the next step is usually to train a computer to recognize the "flavor" of the signal (the interesting particles) versus the background noise (the boring stuff).

The Magic Box: CoLLM doesn't stop at writing the recipe. It automatically takes the prepared data and feeds it into three different types of "tasting machines" (Deep Learning models):
- MLP: A simple, fast taster for standard data.
- GNN: A smart taster that understands how particles are connected to each other, like a social network of ingredients.
- Transformer: A super-taster that looks at the whole picture at once, understanding long-range relationships between particles.
The Result: It trains these models, checks how well they work, and gives you a report card with graphs showing exactly how good the model is at finding the "needle in the haystack."

4. The User Interface: Two Ways to Order

CoLLM is designed to be friendly to everyone, whether you are a tech wizard or just want to get things done.

The Terminal (TUI): For the pros who like to type commands and run scripts in the background.
The Graphical Interface (GUI): A colorful, clickable website where you can type your idea, hit a button, and watch the AI work in real-time, showing you the graphs as they are drawn.

Why is this a big deal?

In the past, a physicist had to be a master coder, a data scientist, and a particle expert all at once. If you were great at physics but bad at coding, you were stuck.

CoLLM acts as a universal translator. It lowers the barrier to entry, allowing scientists to focus on the physics (the "what" and "why") rather than the coding (the "how"). It ensures that the code is not just written, but is physically correct, reproducible (you get the same result every time), and automatically validated.

In short: CoLLM is a tool that lets you describe a complex particle physics experiment in plain English, and it automatically writes the code, fixes its own mistakes, and trains a smart AI to find the answer, all without you needing to be a coding expert.

Technical Summary: CoLLM – AI Engineering Toolbox for End-to-End Deep Learning in Collider Analyses

1. Problem Statement

Modern collider analyses at the Large Hadron Collider (LHC) face a dual challenge: increasing data volumes and escalating analytical complexity. A typical analysis requires translating high-level physics concepts (e.g., object reconstruction, event selection, kinematic observable computation) into executable code, followed by the implementation of deep learning pipelines for signal-background classification. This translation process is time-consuming, prone to transcription errors (such as incorrect particle identification codes or inconsistent kinematic cuts), and demands expertise in both particle physics and software engineering.

While Large Language Models (LLMs) have shown promise in accelerating scientific workflows, their direct application to full collider analysis pipelines is limited. Generic LLMs lack embedded knowledge of high-energy physics (HEP) conventions, cannot natively execute or validate the code they generate, and produce non-deterministic outputs that compromise reproducibility. Furthermore, the "vibe coding" approach (relying on AI-generated code without rigorous review) is risky in physics where correctness is paramount.

2. Methodology: The CoLLM Framework

CoLLM is an open-source Python framework designed to bridge the gap between natural language analysis specifications and trained deep learning classifiers. It operates as an end-to-end pipeline consisting of two tightly integrated components:

2.1 LLM-Based Code Generation Engine

The first stage translates plain language specifications into validated Python code for event preselection and feature extraction.

Structured Input: User inputs are organized into three semantic sections: Selection Cuts (object multiplicities, kinematic constraints), Validation Plots (diagnostic distributions), and Output Structure (observables for deep learning).
Physics-Aware System Prompt: To mitigate the lack of domain knowledge in generic models, CoLLM employs a comprehensive system prompt. This prompt encodes:
- The LHCO (LHC Olympics) data format specifications.
- Standard particle identification codes (e.g., type 6 for MET).
- Kinematic formulas (e.g., invariant mass, transverse mass) with explicit warnings against common LLM errors (e.g., summing vs. subtracting 4-momenta).
- Reference helper functions for parsing and object selection.
Deterministic Decoding: To ensure reproducibility, the primary generation model uses a temperature of $T=0$ with greedy decoding, making the output a deterministic function of the input prompt.
Automatic Error Correction (PyFixer): A secondary LLM, operating in an exploratory mode ( $T=0.9$ ), iteratively repairs execution failures. It analyzes tracebacks and modifies only the faulty code segments rather than regenerating the entire script, preserving validated logic.

2.2 Automated Deep Learning Pipeline

The second stage consumes the features extracted by the generated code to train signal-background classifiers. The framework supports three architectures, configurable via YAML or a Graphical User Interface (GUI):

Multi-Layer Perceptrons (MLPs): For fixed-length, high-level kinematic feature vectors.
Graph Neural Networks (GNNs): For variable-multiplicity particle sets (e.g., jets, tracks), treating particles as nodes and relations as edges. Supports Graph Convolutional Networks (GCNs), Dynamic Edge Convolution (EdgeConv), and Graph Attention Networks (GATs).
Transformer Networks: For particle cloud representations using self-attention mechanisms to model long-range dependencies without fixed topology.

The pipeline automates data loading, normalization, model construction, training (with callbacks for early stopping, learning rate scheduling, and mixed precision), and evaluation using standard HEP metrics (e.g., AUC).

2.3 User Interfaces

CoLLM provides two interfaces:

Terminal User Interface (TUI): Uses YAML configuration files for batch processing and reproducible workflows.
Graphical User Interface (GUI): A Streamlit-based web interface for interactive configuration, real-time monitoring, and visual debugging.

3. Key Contributions

End-to-End Automation: CoLLM provides a unified workflow from natural language physics specifications to trained deep learning classifiers, reducing the manual coding burden.
Physics-Aware Generation: Unlike generic code generators, CoLLM embeds HEP conventions directly into the generation context via a specialized system prompt, ensuring physical consistency in kinematic calculations and object handling.
Deterministic Reproducibility: By enforcing $T=0$ decoding for the primary generator and utilizing a structured error correction loop, CoLLM addresses the non-determinism inherent in standard LLM applications.
Modular Deep Learning Integration: The framework seamlessly integrates three distinct neural network families (MLP, GNN, Transformer) tailored to different collider event representations.
Validation and Benchmarking: The authors provide a systematic validation study using five benchmark processes ( $pp \to W^+W^-$ , $t\bar{t}$ , $H \to \gamma\gamma$ , $WZ$, $Hjj$) to demonstrate the framework's ability to generate correct selection logic and diagnostic plots.

4. Results

The paper validates CoLLM using the meta-llama/Llama-3.3-70B-Instruct model on five benchmark analyses.

Code Correctness: The framework successfully generated executable Python scripts for complex semi-leptonic top-quark pair production and other processes, correctly parsing LHCO files, applying selection cuts, and computing kinematic variables.
Reproducibility: In repeated runs with identical inputs, the framework produced consistent cutflow results. Minor variations observed were attributed to ambiguities in the user prompt (e.g., the definition of "leading jets") rather than model stochasticity, highlighting the importance of precise user specifications.
Physics Validation: Generated histograms (e.g., dijet invariant mass, transverse mass) exhibited expected physical features, such as peaks near the $W$ boson and top quark masses, and Jacobian edges for $W \to \ell\nu$ decays.
Error Correction: The PyFixer module resolved the majority of execution errors within one or two refinement iterations, demonstrating the efficacy of the iterative repair mechanism.

5. Significance and Claims

The authors position CoLLM not as a replacement for physicist expertise, but as a tool for "vibe engineering"—a disciplined approach where LLMs assist in code generation while the framework enforces strict validation and physics constraints.

Lowering the Barrier: CoLLM aims to simplify the technical complexity of collider analyses, making sophisticated event selections and deep learning methods accessible to physicists who may lack extensive programming experience.
Reliability over Speed: The paper emphasizes that while generic LLMs are useful for auxiliary tasks, they fail to meet the rigorous requirements of collider physics due to a lack of domain knowledge and reproducibility. CoLLM addresses this by integrating domain-specific prompts and automated validation loops.
Current Limitations: The authors modestly acknowledge current constraints:
- Code generation is currently restricted to the LHCO text format and does not yet support the ROOT data format widely used in experimental analyses.
- Ambiguities in natural language inputs can still lead to variations in generated code, requiring users to be precise in their specifications.
- The framework relies on the availability of specific LLMs and computational resources (GPUs) for local inference, though it supports cloud API alternatives.

In conclusion, CoLLM represents a significant step toward automating the technical execution of collider analyses, ensuring that the resulting code is not only syntactically correct but also physically consistent and reproducible.