An AI-ready, Polarized Electron-Positron Collision… — Plain-Language Explanation

Original authors: Chi Lung Cheng, Simon Corrodi, T. J. Hobbs, Alaettin Serhan Mete, Benjamin Nachman

Published 2026-06-02

📖 5 min read🧠 Deep dive

Original authors: Chi Lung Cheng, Simon Corrodi, T. J. Hobbs, Alaettin Serhan Mete, Benjamin Nachman

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine a massive, high-tech library from the 1990s that holds the blueprints and logs of a very special experiment. This experiment, called SLD, was like a "Z-factory," smashing electrons and positrons together to create a particle called the Z boson. What made this factory unique was that the electron beams were "polarized"—think of them as spinning tops all spinning in the same direction. This allowed scientists to measure things with incredible precision that other colliders couldn't.

However, for decades, the data from this factory was locked away in a digital vault. The files were written in an ancient, obscure language (a mix of old Fortran code and binary formats) that modern computers couldn't read, and the "keys" to open them (the original software and documentation) had been lost or scattered.

This paper is the story of how a team of scientists used modern Artificial Intelligence (AI) to break into that vault, translate the ancient language, and open the doors for everyone.

Here is a breakdown of what they did, using simple analogies:

1. The "Time Capsule" Data

The team released about 660,000 reconstructed events (snapshots of particle collisions) from 1996 to 1998.

The Problem: These files were like a cassette tape in a language no one speaks anymore. The original software to read them was gone, and the documentation was just piles of paper in an archive.
The AI Solution: They used AI agents (specifically, a tool called "Claude") to act as a digital archaeologist. The AI looked at the raw binary data (the 1s and 0s) and compared it against known physics laws (like a detective checking a suspect's alibi against the crime scene).
- Analogy: Imagine finding a locked box with no key. Instead of breaking it, you look at the scratches on the box, guess what's inside based on the weight, and then use a smart assistant to figure out the combination lock code. The AI helped them reverse-engineer the code to read the data.
The Result: They built a new, open-source tool called jazelle that translates these ancient files into modern, easy-to-use formats (like Parquet) that any data scientist can now use.

2. The "Lost Library" of Documentation

Along with the data, they digitized about 1,190 internal documents.

The Problem: These were physical papers, many of them photocopies of photocopies, with handwritten notes, messy diagrams, and typed text all mixed together. Standard scanners often fail on this kind of "messy" paper.
The AI Solution: They tested four different AI tools to read these documents.
- Analogy: It's like trying to read a handwritten recipe card that has coffee stains and doodles on it. Some AI tools tried to turn the handwriting into text but got confused by the grid lines on the paper. Others were great at reading tables but failed at math equations.
- They found that by combining the best tools, they could turn these messy pages into searchable text. They even built an AI "Librarian" (a question-answering system) that can read these documents and answer specific questions, like "What was the clock speed of the microprocessor used in 1995?"

3. Proving It Works (The "Test Drive")

Before handing over the keys, the team had to prove the data was accurate. They didn't just guess; they ran a "test drive."

The Test: They took the newly translated data and ran the exact same physics calculations the original scientists did 20 years ago.
The Result: The numbers matched. They successfully recreated the famous measurements of the "weak mixing angle" (a fundamental property of the universe) using the new data. This proved that the AI translation didn't break anything; it just made the data readable again.

4. Why This Matters for AI Research

The paper highlights that this dataset is a unique training ground for modern Artificial Intelligence.

The Gap: Most AI models in physics are trained on data from proton-proton collisions (like at the Large Hadron Collider), which are messy and chaotic.
The SLD Difference: The SLD data is "clean" and the initial conditions are perfectly known.
The "New Territory": The researchers tested a modern AI model (called OmniLearned) on this data. They found that the SLD data occupies a completely different "neighborhood" in the AI's brain (latent space) compared to other datasets.
- Analogy: If you train a dog to fetch a ball in a park, it might get confused if you suddenly ask it to fetch a ball in a swimming pool. This dataset is the "swimming pool" that current AI models have never seen. By releasing it, the team is giving AI researchers a new, unique environment to learn from, which could help them build better, more versatile models.

Summary

In short, this paper is about resurrecting a lost scientific treasure. The team used AI to translate ancient, unreadable data and messy paper notes into a modern, usable format. They proved the translation is accurate by re-running old physics experiments, and they showed that this unique data offers a fresh, clean playground for training the next generation of AI models in particle physics.

Technical Summary: An AI-ready, Polarized Electron-Positron Collision Dataset

Problem Statement
Despite the lasting physics impact of the SLD experiment at the SLAC Linear Collider (SLC), its reconstructed data from the 1996–1998 run (approximately 660,000 events) remained inaccessible to modern analysis tools. The data existed in legacy "Jazelle" binary formats, decoded by software written in Mortran (a Fortran extension) that is no longer operational on modern systems. Furthermore, the proprietary and poorly documented ecosystem meant that critical data structures, such as the per-event electron-beam polarization bank (PHBM), were effectively lost. This inaccessibility represents a bottleneck for machine learning (ML) in particle physics, which currently relies heavily on proton-proton collision data (LHC) and lacks diverse, high-quality datasets from the $e^+e^-$ regime, particularly those featuring known initial-state polarization. Additionally, the institutional knowledge required to interpret these legacy datasets resides in physical internal notes that were never digitized.

Methodology
The authors executed a two-pronged modernization effort involving data reconstruction and documentation digitization:

Data Reconstruction and Translation:
- Reverse Engineering: The team reverse-engineered the binary Jazelle format using AI assistance (specifically Anthropic's Claude). They combined partial legacy documentation with "physics-based ground truth" (e.g., kinematic constraints of $Z \to q\bar{q}$ decays) to identify candidate field positions and data types within the binary banks.
- The jazelle Toolkit: An open-source Python package was developed to read the legacy binaries and emit Awkward record arrays. These are serialized into modern, columnar formats (Parquet, HDF5, Feather).
- Scope: The release covers the 1996–1998 runs. It includes event headers, beam information (including polarization), charged tracks, calorimeter clusters, particle identification subsystems, and relational tables. It applies standard data-quality requirements but no specific channel selection.
Documentation Digitization and AI-Readiness:
- Corpus: Approximately 1,190 internal SLD/SLC notes (mostly from 1980–1988) were scanned from physical archives.
- Extraction Pipeline: Four tools were evaluated for text extraction: Marker, Docling, Nougat (open-weight models), and the Azure AI Document Intelligence API. The pipeline handles heterogeneous inputs, including typewritten notes, photocopies, hand-drawn figures, and complex tables.
- Agentic Workflow: The extracted text was indexed using hybrid retrieval (dense embeddings + keyword search). An agentic question-answering system was built to demonstrate the corpus's utility, utilizing a Model Context Protocol (MCP) server for iterative retrieval and reasoning.

Key Results

Physics Validation: The authors reproduced canonical SLD measurements on the translated dataset to validate internal consistency:
- Kinematic Distributions: Reconstructed visible mass spectra and event-shape variables ( $\tau$ ) matched expected $Z$ -pole physics (e.g., back-to-back two-jet topology).
- Asymmetry Measurements: The left-right cross-section asymmetry ( $A_{LR}$ ) and leptonic coupling asymmetries ( $A_\ell$ ) were extracted via event counting. The derived effective weak mixing angle ( $\sin^2 \theta_{eff}^W = 0.23144 \pm 0.00044$ from $A_{LR}$ ) aligns with published values, confirming the dataset preserves polarization-sensitive content.
- Limitations: The authors note that raw $A_{LR}$ values differ slightly from published results because the released dataset lacks the specific electroweak correction software (ZFITTER) used in the original analysis. Similarly, leptonic channel counts show minor discrepancies due to unavailable original selection software.
ML Demonstration: Using the OmniLearned foundation model, the authors embedded SLD jets alongside jets from ALEPH ( $e^+e^-$ ), H1 ($ep$), and JetClass ($pp$). t-SNE projection revealed that SLD data occupies a distinct region in the latent space, separated by initial state and energy scale. Crucially, as the only reconstructed detector data in the comparison, it represents a regime (polarized $e^+e^-$ at the Z pole) not captured by current public MC simulations.
Documentation Performance: An agentic QA system achieved near-saturation task completion (60/61 questions) on a self-generated benchmark by iteratively reformulating queries. This demonstrated that the digitized corpus supports complex, multi-step scientific exploration, outperforming single-pass RAG baselines.

Significance and Claims
The paper claims this release serves three primary purposes:

Preservation: It saves a unique dataset from the only high-energy linear $e^+e^-$ collider with polarized beams, a configuration not replicated in future colliders.
ML Benchmarking: It provides a clean, well-understood environment with known initial states and polarization to complement the dominant hadron-collider datasets in ML research. The distinct latent space of SLD data offers a new testbed for transfer learning and domain-shift benchmarks.
New Physics Potential: The dataset enables new analyses leveraging modern ML and theoretical advances that were not possible during the original SLD operation.

The authors emphasize that the dataset is a "faithful starting point" for analyses that supply missing radiative corrections and systematic treatments, rather than a re-derivation of final published results. The work also illustrates a broader pattern: legacy datasets with lost software can be recovered by combining surviving documentation, physics constraints, and modern AI tools.

An AI-ready, Polarized Electron-Positron Collision Dataset