NuBench: An Open Benchmark for Deep Learning-Based Event Reconstruction in Neutrino Telescopes

Imagine you are trying to solve a massive, three-dimensional puzzle in the middle of a pitch-black ocean. You can't see the pieces, and you can't see the picture. All you have are a few scattered, glowing fireflies that blink at specific times. Your job? To figure out exactly where the fireflies came from, how fast they were moving, and what kind of creature they were.

This is essentially what scientists do with neutrino telescopes. Neutrinos are ghost-like particles that zip through the Earth almost without touching anything. When they do hit something deep underwater or under the ice, they create a tiny flash of blue light (Cherenkov radiation). Detectors like IceCube (in Antarctica) or KM3NeT (in the Mediterranean) are huge grids of light sensors waiting to catch these flashes.

The problem is that figuring out what caused the flash is incredibly hard. It's like trying to guess the shape, speed, and origin of a car just by hearing the sound of its horn echo off a canyon wall.

The Problem: Too Many Puzzles, Too Few Rules

For a long time, every team building a neutrino telescope (IceCube, KM3NeT, Baikal-GVD, etc.) has been trying to solve this puzzle in their own way. They have their own secret recipes (algorithms) and their own private datasets. It's like if every chef in the world was trying to make the perfect soup but refused to share their recipes or ingredients. This makes it hard to know who is actually making the best soup.

The Solution: NuBench (The "Neutrino Gym")

This paper introduces NuBench, which is essentially a giant, open-source "gym" for artificial intelligence (AI) to train on.

Think of NuBench as a massive, shared video game level that anyone can download. It contains 130 million simulated neutrino events.

The Simulations: Instead of waiting for real neutrinos (which are rare), the scientists used a supercomputer to simulate what would happen if neutrinos hit six different types of detectors.
The Detectors: They didn't just copy one detector. They built digital versions of six different "layouts," ranging from dense forests of sensors (like a crowded city) to sparse deserts of sensors (like a wide-open field). This tests if the AI can learn the rules of physics or if it just memorized the specific layout of one detector.
The Data: For every simulated event, the AI gets the "raw data" (the timing and brightness of the light flashes) and the "answer key" (the true energy, direction, and type of the neutrino).

The Challenge: The AI Olympics

The authors didn't just build the gym; they put four different AI "athletes" in it to see who could solve the puzzles best. These athletes are:

ParticleNet & DynEdge: The current champions used by real-world experiments. They are like expert detectives who look at the clues locally (neighbor by neighbor).
DeepIce: A winner from a previous public contest. It uses a "Transformer" architecture (the same tech behind modern chatbots) that looks at the whole picture at once.
GRIT: A new hybrid athlete that tries to combine the best of both worlds.

They tested these AIs on five different tasks:

Energy: How much power did the neutrino have? (Like guessing the speed of a car).
Direction: Where did it come from? (Like pointing to the source of a sound).
Shape (Track vs. Cascade): Did it leave a long trail (like a bullet) or a short puff (like a firework)?
Location: Exactly where did the crash happen?
Inelasticity: A fancy physics term for how much energy was "lost" in the crash versus kept by the particle.

The Results: What Did We Learn?

The paper found some fascinating things, which can be summarized with a few metaphors:

Density Matters: If you want to know exactly where a crash happened (Vertex) or how the energy was split (Inelasticity), you need a dense detector (lots of sensors close together). It's like trying to find a needle in a haystack; if the hay is packed tight, you find it easier. The AI did much better on the "dense" detector simulations.
Volume Matters: If you want to know the direction of a high-speed neutrino, a large detector is better. Even if the sensors are far apart, a huge volume gives the AI enough context to see the "line" the particle traveled.
The "Global" View Wins: For figuring out direction, the AI that looked at the whole picture at once (DeepIce and GRIT) beat the ones that looked at clues one by one. It's like the difference between trying to solve a maze by looking at one square at a time versus seeing the whole map.
No Single Winner: There wasn't one "perfect" AI for everything. Sometimes the old-school detective (DynEdge) was best; sometimes the new global thinker (DeepIce) won. It depends on the specific puzzle and how much energy the neutrino had.

Why This Matters

This paper is a huge step forward because it breaks down the walls between different scientific teams. By providing a common benchmark, anyone in the world can now download the data, train their own AI, and say, "My new method is 5% better than the current standard."

It turns neutrino physics from a collection of isolated experiments into a collaborative global sport, where the goal is to build the smartest, most accurate "ghost-hunting" AI possible. The datasets and the winning models are all open-source, meaning the door is wide open for the next generation of scientists to push the boundaries of what we know about the universe.

1. Problem Statement

Neutrino telescopes (e.g., IceCube, KM3NeT, Baikal-GVD) detect Cherenkov radiation from neutrino interactions to infer properties of the incident neutrino (energy, direction, interaction vertex, etc.). This process, known as event reconstruction, involves solving complex inverse problems.

Current Challenge: While deep learning (DL) has revolutionized reconstruction, cross-experimental collaboration is hindered by a lack of diverse, open-source datasets. Existing benchmarks are often specific to a single detector (e.g., IceCube), contaminated by atmospheric muons, or limited to specific tasks (e.g., direction only).
Need: A unified, open benchmark is required to compare reconstruction algorithms across different detector geometries (water vs. ice, varying densities) and multiple reconstruction tasks to accelerate the development of next-generation neutrino physics.

2. Methodology: The NuBench Framework

The authors introduce NuBench, a comprehensive open benchmark comprising simulated datasets and a standardized evaluation framework.

A. Dataset Generation

Scale: Nearly 130 million simulated neutrino events.
Physics: Includes both Charged-Current (CC) and Neutral-Current (NC) muon-neutrino interactions ( $\nu_\mu^{CC}$ and $\nu_\mu^{NC}$ ) with energies ranging from 10 GeV to 100 TeV.
Detector Geometries: Simulated across six distinct geometries inspired by existing and proposed telescopes:
1. Flower S, L, XL: Sunflower-style layouts (inspired by KM3NeT-ORCA, ARCA, and TRIDENT) with varying string densities.
2. Triangle & Cluster: Small clusters of strings (inspired by P-ONE and Baikal-GVD).
3. Hexagon: A hexagonal array with an inner "infill" (inspired by IceCube and DeepCore).
Medium: Six datasets are simulated in water, and one (Hexagon Ice LE) in ice.
Simulation Tool: Uses PROMETHEUS, an open-source simulation tool. The pipeline simulates neutrino interactions, traces Cherenkov photons, and applies a simplified detector response (merging photons into pulses, adding noise, and applying trigger conditions) to generate "pulse-level" data.
Data Structure:
- Pulse-level: Input features include DOM position, arrival time, charge, and signal fraction.
- Event-level (Ground Truth): Includes true neutrino energy, direction, interaction vertex, inelasticity, and event topology (Track vs. Cascade).

B. Reconstruction Algorithms Evaluated

The paper evaluates four state-of-the-art deep learning architectures, all implemented in the open-source library GraphNeT:

ParticleNet: A Graph Neural Network (GNN) currently used in KM3NeT.
DynEdge: A GNN currently used in IceCube.
GRIT: A new hybrid algorithm combining GNNs with Transformer-style attention mechanisms.
DeepIce: A Transformer-based model (winner of the "IceCube – Neutrinos in Deep Ice" Kaggle challenge).

C. Evaluation Tasks

The models are tested on five core reconstruction tasks:

Energy Reconstruction: Estimating incident neutrino energy.
Direction Reconstruction: Estimating the arrival direction (zenith/azimuth).
Topology Classification: Distinguishing between Tracks (muons) and Cascades (electrons/hadrons).
Interaction Vertex Prediction: Locating the 3D point of neutrino interaction.
Inelasticity Estimation: Determining the fraction of energy transferred to hadronic products (relevant for CC interactions).

3. Key Contributions

First Multi-Geometry Benchmark: NuBench is the first open resource providing diverse datasets across multiple detector layouts and media (water/ice), enabling the study of algorithm generalizability.
Comprehensive Task Coverage: Unlike previous challenges focused on single tasks, NuBench covers the full spectrum of reconstruction needs (energy, direction, vertex, topology, inelasticity).
Standardized Baselines: The paper provides rigorous, reproducible baselines using four distinct architectures trained on the same data, allowing for fair comparison of GNNs vs. Transformers vs. Hybrid models.
Open Science: All datasets, model artifacts, and predictions are publicly available to foster community-wide development.

4. Key Results & Findings

A. General Trends

Detector Density vs. Task:
- High-Density Detectors (e.g., Flower S): Excel at tasks requiring fine spatial resolution, such as Vertex and Inelasticity reconstruction.
- Large-Volume/Low-Density Detectors (e.g., Flower XL): Perform better at Direction reconstruction for high-energy tracks due to the extended path length of muons.
Morphology Impact: Reconstruction performance varies significantly between CC (Track-like) and NC (Cascade-like) events. NC events generally show higher variance in energy and direction reconstruction due to the lack of a long track.

B. Algorithm Performance

Direction Reconstruction:
- DeepIce (Transformer) consistently outperformed all other models across nearly all datasets, achieving the lowest median opening angles.
- GRIT (Hybrid) was a close second.
- Insight: Global attention mechanisms (dot-product attention) appear superior to localized graph convolutions for capturing the global topology required for direction estimation.
Vertex Reconstruction:
- DynEdge was the clear winner, outperforming ParticleNet and GRIT significantly (often by a factor of 2 in error).
- Insight: Despite architectural similarities between DynEdge and ParticleNet, specific architectural choices in DynEdge led to superior optimization for spatial localization.
Energy Reconstruction:
- Performance was highly correlated between ParticleNet, DynEdge, and GRIT. No single architecture dominated; results alternated based on the dataset and energy range. Global attention provided limited benefit here compared to direction tasks.
Topology Classification (Track/Cascade):
- GRIT achieved the highest Area Under the Curve (AUC) on datasets with high class imbalance (e.g., Flower XL), likely because it was trained on the full dataset rather than a balanced subsample.
- Performance gaps between models were generally smaller for classification than for regression tasks.
Inelasticity:
- DynEdge generally performed best at low-to-intermediate energies.
- GRIT and ParticleNet showed competitive performance at high energies.
- All models exhibited multimodal prediction distributions, particularly at low energies where separating hadronic and leptonic components is difficult.

5. Significance and Conclusion

Validation of Deep Learning: The study confirms that deep learning architectures trained on one geometry generalize well to others, validating the potential for cross-experimental collaboration.
Architecture Selection: There is no "one-size-fits-all" model.
- Use Transformers/Attention (DeepIce/GRIT) for Direction.
- Use Optimized GNNs (DynEdge) for Vertex and potentially Inelasticity.
- Use GNNs (ParticleNet/DynEdge) for Energy.
Future Impact: NuBench provides a critical infrastructure for the neutrino community to develop the next generation of reconstruction algorithms for upcoming detectors (e.g., IceCube-Gen2, KM3NeT-ARCA). It shifts the paradigm from isolated, experiment-specific development to a unified, open-science approach.

The paper concludes that while likelihood-based methods remain important, deep learning offers competitive accuracy at speeds orders of magnitude faster, and open benchmarks like NuBench are essential for accelerating progress in neutrino astronomy.

NuBench: An Open Benchmark for Deep Learning-Based Event Reconstruction in Neutrino Telescopes

The Problem: Too Many Puzzles, Too Few Rules

The Solution: NuBench (The "Neutrino Gym")

The Challenge: The AI Olympics

The Results: What Did We Learn?

Why This Matters

1. Problem Statement

2. Methodology: The NuBench Framework

A. Dataset Generation

B. Reconstruction Algorithms Evaluated

C. Evaluation Tasks

3. Key Contributions

4. Key Results & Findings

A. General Trends

B. Algorithm Performance

5. Significance and Conclusion

More like this

Probing Neutral Triple Gauge Couplings via $ZZ$ Production at e+e−e^+e^-e+e− Colliders with Machine Learning

Multiplicity dependence of prompt and non-prompt J/ψ\psiψ production at midrapidity in pp collisions at s=13\sqrt{s} = 13s​=13 TeV

Recent Neutrino Oscillation and Cross-Section Results from the T2K Experiment

Search for the lepton-flavour violating decays B+→π+μ±e∓B^+ \to \pi^+ \mu^\pm e^\mpB+→π+μ±e∓

Long-term stability study of single-mask triple GEM detector: impact of continuous irradiation

Probing Neutral Triple Gauge Couplings via $ZZ$ Production at $e^+e^-$ Colliders with Machine Learning

Multiplicity dependence of prompt and non-prompt J/ $\psi$ production at midrapidity in pp collisions at $\sqrt{s} = 13$ TeV

Search for the lepton-flavour violating decays $B^+ \to \pi^+ \mu^\pm e^\mp$