Adapting Vision-Language Models for Neutrino Event… — Plain-Language Explanation

Original authors: Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi

Published 2026-05-11

📖 4 min read🧠 Deep dive

Original authors: Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to solve a mystery inside a giant, high-tech camera. This camera doesn't take photos of people or landscapes; it takes pictures of invisible particles zipping through a tank of liquid argon. When these particles crash into the atoms in the tank, they leave behind faint, pixelated trails—like footprints in the snow.

The goal of this research is to teach a computer to look at these "snow footprints" and instantly say: "Ah, this is a muon (a heavy, long-trailing particle)" or "This is an electron (a fuzzy, spreading cloud)" or "This is just background noise."

Here is how the paper breaks down the solution, using simple analogies:

1. The Old Way: The Specialized Artisan (CNN)

For years, physicists used a specific type of AI called a Convolutional Neural Network (CNN). Think of this like a master artisan who has spent decades learning to recognize specific patterns. They are very fast and efficient, but they only know what they were explicitly taught. If you show them a slightly blurry photo or a strange angle, they might get confused. They are great at the job, but they can't explain why they made a decision; they just give you a "Yes" or "No" answer.

2. The New Contender: The Vision-Only Scholar (ViT)

Then came Vision Transformers (ViT). Imagine a scholar who looks at the entire picture at once, rather than scanning it piece by piece. This scholar is better at connecting distant dots (like a long, winding track across the whole image). The paper found that this scholar is more robust than the artisan. Even if the photo is blurry or low-resolution, the scholar can still figure out what's happening.

3. The Star of the Show: The Vision-Language Model (VLM)

Finally, the researchers tried something new: a Vision-Language Model (VLM), specifically a version of LLaMA 3.2.
Think of this model not just as a detective, but as a detective who is also a physics professor.

It sees the image: It looks at the pixelated footprints just like the other models.
It speaks the language: It has been trained on massive amounts of text and images. It understands concepts like "muon track," "electron shower," and "neutral current."

The Magic Trick:
When you ask the VLM to classify a particle, it doesn't just spit out a label. It writes a short essay explaining its reasoning.

Example: "I see a long, narrow line in the image. Based on my training, long lines usually mean a muon. Therefore, this is a Muon event."

What Did They Find?

The researchers tested these three "detectives" on a massive dataset of simulated particle collisions. Here is the verdict:

Accuracy: The VLM (the Professor) and the ViT (the Scholar) were the winners. They were slightly more accurate and much better at handling blurry or low-quality images than the CNN (the Artisan).
The "Blind" Test: When the researchers tried to use the VLM without teaching it the specific rules of the game (just showing it a few examples), it failed miserably. It guessed the same answer for everything. This taught them that you must fine-tune (train) these big models specifically for physics; you can't just ask them to "guess" based on general knowledge.
The Trade-off: The VLM is the smartest and most explainable, but it is also the slowest and most expensive to run. It requires a lot of computer memory and takes seconds to analyze one event, whereas the CNN does it in milliseconds.
- Analogy: The CNN is a sprinter who finishes the race in a flash but can't tell you the strategy. The VLM is a marathon runner who takes longer but can write a detailed book about the race strategy afterward.

Why Does This Matter?

The paper concludes that we don't have to choose just one. We can use them for different jobs:

Use the CNN when you need speed, like filtering data in real-time as it comes in from the detector.
Use the VLM for deep, offline analysis. When a physicist finds a weird event and wants to know why the computer flagged it, the VLM can provide a human-readable explanation that connects the pixels to physics concepts.

In short: This paper proves that we can teach giant, text-savvy AI models to "see" particle physics. While they are slower than traditional tools, they offer a powerful new ability: they can not only classify events but also explain their reasoning in plain English, bridging the gap between complex data and human understanding.

Technical Summary: Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics

Problem Statement
In high-energy physics (HEP), specifically within neutrino experiments like the Deep Underground Neutrino Experiment (DUNE), event classification is critical for distinguishing signal interactions (electron and muon neutrino charged-current events) from background (neutral current interactions). Traditionally, this task relies on reconstructing high-level objects and engineering specific features (e.g., energy, spatial configuration) to feed into algorithms ranging from decision trees to shallow neural networks. While effective, this approach is limited by reconstruction errors and the constraints of predefined features. Furthermore, deep learning models, particularly Convolutional Neural Networks (CNNs), often operate as "black boxes," lacking interpretability regarding why a specific prediction was made. Although Vision Transformers (ViTs) have improved performance by capturing long-range spatial dependencies, they still lack the ability to provide natural language reasoning or integrate semantic context.

Methodology
The authors propose adapting a Vision-Language Model (VLM), specifically a fine-tuned variant of LLaMA 3.2 Vision (11B parameters), to classify neutrino interactions directly from raw detector pixel maps.

Dataset: The study utilizes a custom simulation of a Liquid Argon Time Projection Chamber (LArTPC) with a 5 mm pixel resolution. The dataset comprises 190,000 simulated events ( $\nu_e$ CC, $\nu_\mu$ CC, and Neutral Current) generated using GENIE and GEANT4. Data is represented as pairs of 2D grayscale images (XZ and YZ projections) cropped to 512 $\times$ 512 pixels.
Model Architecture & Training:
- VLM (LLaMA 3.2 Vision): The model integrates a high-resolution ViT-h/14 vision encoder with a transformer-based language decoder. To adapt this 11B parameter model to the specific physics task without prohibitive computational costs, the authors employ QLoRA (Quantized Low-Rank Adaptation). This parameter-efficient fine-tuning (PEFT) method quantizes base weights to 4-bit precision and trains only low-rank adapter matrices (29.5M trainable parameters) over a single epoch. The training pipeline uses a physics-informed system prompt describing the detector geometry and interaction characteristics, followed by a user prompt requesting classification.
- Baselines: The VLM is benchmarked against two established architectures:
  1. A ViT-h/14 (632M parameters), representing the vision backbone of the VLM, trained via full fine-tuning for 10 epochs.
  2. A Siamese SE-ResNet CNN (21.7M parameters), representing the state-of-the-art convolutional approach used in major neutrino experiments, trained via full fine-tuning for 300 epochs.
Inference & Explainability: The VLM generates predictions autoregressively. To ensure machine-readable outputs, the authors apply phrasal constraints during decoding, forcing the model to output a fixed prefix followed by the class label. Crucially, the model is capable of generating natural language explanations justifying its classification based on visual features (e.g., "longer and narrower muon track" vs. "fuzzy electron shower").

Key Results

Classification Performance: The fine-tuned LLaMA 3.2 Vision achieved the highest accuracy, precision, and recall (0.87 each) with an AUC-ROC of 0.96. This performance was comparable to the fully fine-tuned ViT-h/14 (0.86 accuracy, 0.96 AUC) and significantly superior to the CNN baseline (0.80 accuracy, 0.94 AUC).
Parameter Efficiency: The VLM achieved these results by updating only 29.5M parameters (via QLoRA) in a single epoch, whereas the ViT required 632M parameters over 10 epochs, and the CNN required 21.7M parameters over 300 epochs.
Robustness (Generalization): Under a distribution shift involving downsampling the input images to 256 $\times$ 256 pixels (simulating degraded detector resolution), the transformer-based models (VLM and ViT) maintained high performance (0.85 accuracy). In contrast, the CNN baseline suffered a severe degradation, dropping to 0.43–0.49 accuracy.
Explainability: Unlike the CNN and ViT, which provide only numerical probabilities, the VLM generated human-readable explanations grounded in event topology. An ablation study showed that even without explicit physics definitions in the system prompt, the model maintained high accuracy (0.86) and generated plausible physics-related explanations, suggesting it internalized task-relevant features during fine-tuning.
Few-Shot Limitations: A few-shot in-context evaluation using the frozen pre-trained model (without fine-tuning) failed to distinguish between classes (accuracy ~0.37), demonstrating that parameter adaptation is necessary for this specific domain.

Significance and Claims
The paper claims that Vision-Language Models represent a promising new direction for HEP event classification, offering a unique combination of strong predictive performance, robustness to detector variations, and enhanced interpretability.

The authors highlight that while VLMs incur higher computational costs (12.9 GB memory vs. 2.4 GB for CNN; ~3.4s inference vs. ~24ms), their ability to provide physics-grounded textual justifications offers a distinct advantage for offline analysis, error diagnosis, and building trust in machine learning-driven scientific workflows. The results suggest that transformer-based architectures, particularly when adapted via parameter-efficient methods, can serve as general-purpose backbones for physics event classification. The study posits that this approach could pave the way for reusable "HEP foundation models" that generalize across different experiments with minimal further fine-tuning, bridging the gap between raw accuracy and the need for transparent, reasoning-based predictions in experimental physics.

Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics

1. The Old Way: The Specialized Artisan (CNN)

2. The New Contender: The Vision-Only Scholar (ViT)

3. The Star of the Show: The Vision-Language Model (VLM)

What Did They Find?

Why Does This Matter?

More like this