Transformer-Based Pulse Shape Discrimination in HPGe Detectors with Masked Autoencoder Pre-training

Imagine you are a detective trying to solve a mystery in a very quiet, dark room. The "mystery" is a rare event called neutrinoless double-beta decay, a process that, if found, would rewrite our understanding of the universe. But the room is full of "noise"—background radiation from rocks, cosmic rays, and other particles that look almost exactly like the signal you are looking for.

To find the real signal, you need to listen very carefully to the "footsteps" (electrical signals) left by particles as they hit a giant, super-cold germanium crystal detector.

This paper is about teaching a computer to listen to these footsteps better than any human or traditional method ever could. Here is the breakdown in simple terms:

1. The Problem: The "Summary Sheet" vs. The "Full Recording"

Traditionally, when physicists looked at these electrical signals (waveforms), they were like a musician listening to a symphony and only writing down three numbers: "How loud was the first note? How long did it last? How quiet was the end?"

They threw away the rest of the music. They thought, "These three numbers are enough to tell if it's a rock rolling by (background) or a ghost walking by (the signal)."

The Paper's Idea: Why throw away the rest of the recording? What if we let the computer listen to the entire song, from the first note to the last, to hear the subtle differences we missed?

2. The New Tool: The "Transformer"

The authors used a type of AI called a Transformer. You might know Transformers from chatbots that write essays or translate languages. They are amazing at understanding context and long sequences.

The Analogy: Imagine a traditional method is like a security guard who only checks your ID badge (a few summary numbers). The Transformer is like a detective who watches your entire walk through the building: your gait, your speed, how you look around, and your posture. It sees the whole picture.

In this paper, the Transformer looked directly at the raw electrical waves from the detector, without compressing them into summary numbers first.

3. The Secret Sauce: "Masked Autoencoders" (The "Fill-in-the-Blanks" Game)

Training a super-smart AI usually requires millions of labeled examples (e.g., "This wave is a ghost," "That wave is a rock"). But in physics, labeling data is hard, expensive, and slow. You have to be an expert to say, "Yes, that's definitely a background event."

The authors used a clever trick called Masked Autoencoder (MAE) pre-training.

The Analogy: Imagine you have a library of 1 million books, but only 10,000 have their endings written down (labeled).
- Old Way: You try to learn the story using only the 10,000 books with endings.
- The Paper's Way: You take the 1 million books, rip out random pages (masking them), and ask the AI to guess what the missing pages say based on the rest of the story.
- The Result: The AI becomes an expert at understanding the structure of the language (the physics of the detector) just by reading the unlabelled books. Once it's an expert at "filling in the blanks," you only need a few labeled books to teach it the specific mystery you are solving.

The Benefit: This made the AI 2 to 4 times more efficient. It needed far fewer labeled examples to become just as good as the old methods.

4. The Results: Who Won the Race?

The authors compared three racers:

The Old Guard (GBDT): A classic machine learning model that uses the "summary sheet" (hand-crafted numbers).
The Newbie (Transformer from Scratch): The new AI, but it had to learn everything from zero using only the few labeled examples.
The Pro (Transformer with Pre-training): The new AI that played the "fill-in-the-blanks" game first.

The Winner: The Pro (Transformer with Pre-training).

It beat the Old Guard in every category.
It was especially good at spotting the "tricky" background events that usually fool the Old Guard.
It was also better at measuring the energy of the event (like guessing the weight of a package just by looking at it).

5. Why Does This Matter?

In the search for neutrinoless double-beta decay, every bit of background noise you can remove brings you closer to finding the "Holy Grail" of physics.

Better Signal: By using the full waveform, the AI can reject more background noise without accidentally throwing away the real signal.
Faster Science: Because the AI learns faster (thanks to the "fill-in-the-blanks" trick), scientists don't have to wait years to collect enough labeled data to train a new model. They can adapt quickly to new detectors or new experimental setups.

Summary

Think of this paper as upgrading a security system. Instead of just checking a person's ID badge (the old way), the new system watches their entire body language and movement history (the Transformer). And to make sure the system is smart enough to do this without needing a million human trainers, it first plays a game of "guess the missing puzzle piece" using millions of unlabelled photos (Masked Autoencoding).

The result? A smarter, faster, and more accurate detector that brings us one step closer to solving one of the universe's biggest mysteries.

Here is a detailed technical summary of the paper "Transformer-Based Pulse Shape Discrimination in HPGe Detectors with Masked Autoencoder Pre-training."

1. Problem Statement

Context: High-purity germanium (HPGe) detectors enriched in $^{76}$ Ge are critical for searching for neutrinoless double-beta decay ($0\nu\beta\beta$). A major challenge in these experiments (such as the Majorana Demonstrator and the future LEGEND program) is distinguishing rare signal events from background noise.
The Challenge:

Information Loss: Conventional Pulse Shape Discrimination (PSD) methods compress full digitized waveforms into a small set of hand-crafted summary parameters (e.g., peak amplitude, rise time). This compression potentially discards relevant information contained in the full time series.
Label Scarcity: Supervised machine learning requires large labeled datasets. In experimental physics, obtaining "ground truth" event topology labels is difficult; labels are often derived from simulations or analysis-defined proxies, which can introduce noise. Furthermore, labeled data is often scarce compared to the abundance of unlabeled calibration data.
Goal: Develop a model that operates directly on raw digitized waveforms to improve classification accuracy (PSD) and energy reconstruction while reducing the reliance on large labeled datasets through self-supervised pre-training.

2. Methodology

The authors propose a Transformer-based architecture conditioned on detector identity, utilizing a two-stage training strategy: Masked Autoencoder (MAE) Pre-training followed by Supervised Fine-tuning.

A. Data and Input Representation

Dataset: Waveforms from the Majorana Demonstrator (MJD) AI/ML data release (1.04M training, 390k test waveforms).
Input: Raw preamplifier charge pulses (3,800 samples) and their first-order gradients (acting as current proxies).
Preprocessing:
- Baseline subtraction and standardization.
- Windowing: Sequences are split into non-overlapping windows of $W=10$ time steps (100 ns), creating $L=380$ tokens per event.
- Augmentation: Random temporal shifts ( $\pm5$ steps) to improve robustness.
Detector Conditioning: Since different detectors have unique pulse characteristics, the model uses Feature-wise Linear Modulation (FiLM). Each of the 26 detectors is assigned a learned embedding that scales and shifts the token embeddings, allowing a single model to adapt to multiple detectors.

B. Model Architecture

Encoder: A Transformer encoder with 6 layers, 8 attention heads, and 64-dimensional embeddings. It uses sinusoidal positional encodings and pre-norm configurations.
Pooling: Global average pooling aggregates the sequence into a fixed-size event representation.
Heads:
- PSD Heads: Four independent binary classification heads for specific cuts: Low-side $A_{vs}E$ , High-side $A_{vs}E$ , Delayed Charge Recovery (DCR), and Late Charge (LQ).
- Regression Head: Predicts calibrated energy.

C. Training Strategies

Masked Autoencoder (MAE) Pre-training (Self-Supervised):
- Objective: Reconstruct 50% of randomly masked temporal windows using only the visible context.
- Process: An encoder processes visible windows; a lightweight decoder reconstructs the masked waveform and gradient values.
- Loss: Mean Squared Error (MSE) over masked positions only.
- Duration: 200 epochs on unlabeled data.
Supervised Training:
- From Scratch: Training the transformer directly on labeled data.
- Fine-tuning: Initializing the encoder with weights from the MAE pre-training and training on labeled data (20 epochs) with a combined loss (Binary Cross-Entropy for PSD + Huber loss for Energy).
Baseline: A Gradient-Boosted Decision Tree (GBDT) trained on 12 hand-crafted geometric features (charge, current, derivative, and timing metrics).

3. Key Contributions

End-to-End Waveform Modeling: Development of a detector-conditioned Transformer that operates directly on digitized charge traces and gradients, avoiding manual feature compression.
Sample Efficiency via MAE: Demonstration that self-supervised pre-training on abundant unlabeled calibration data significantly reduces the amount of labeled data required to achieve high performance (a 2–4x reduction in low-label regimes).
Comprehensive Benchmarking: A rigorous comparison against a strong GBDT baseline across four distinct PSD targets and energy regression tasks.
Analysis of Combined Performance: Evaluation of the "PSD-pass" definition (requiring simultaneous passing of all four cuts), showing that end-to-end models outperform feature-based methods most significantly in this complex, combined scenario.

4. Key Results

A. Pulse Shape Discrimination (PSD)

Transformer vs. GBDT: Transformers consistently outperformed the GBDT baseline across all metrics (AUROC and F1 score).
- Largest Gains: The improvements were most pronounced for the most challenging cuts (LQ and DCR) and the combined PSD-pass definition.
- Combined PSD-pass: The fine-tuned Transformer achieved an AUROC of 0.9918 and F1 of 0.9415, compared to the GBDT's 0.9598 and 0.8733, respectively.
Fine-tuning vs. Scratch: Fine-tuning after MAE pre-training provided modest but consistent improvements over training from scratch, particularly for difficult labels (e.g., LQ AUROC improved from 0.980 to 0.993).

B. Sample Efficiency

Low-Data Regime: MAE pre-training drastically reduced the labeled data requirement.
- Example: For the "Low AvsE" cut with only 65k labeled waveforms and 2 epochs of training, the fine-tuned model achieved an AUROC of 0.880, matching the performance of a model trained from scratch on 260k waveforms (AUROC 0.881).
- This represents a 4x reduction in labeled data requirements to reach the same performance level.

C. Energy Reconstruction

Both transformer variants showed a slight common underestimation bias (~0.8%).
Fine-tuning modestly narrowed the residual distribution ( $\sigma$ reduced from 0.0424 to 0.0407), indicating slightly better agreement with calibrated energy labels compared to training from scratch.

D. Reconstruction Analysis (Appendix)

The MAE model successfully reconstructed masked waveform segments with high fidelity (MSE $\approx 10^{-4}$ ).
Reconstructions were accurate for standard PSD-pass events. For complex PSD-fail events (e.g., multi-site interactions with double peaks), the model captured the general morphology but struggled with rare, fine-grained event-specific features, which is expected and acceptable for pre-training objectives.

5. Significance and Future Outlook

Impact on $0\nu\beta\beta$ Searches: While the paper does not quantify a direct improvement in half-life sensitivity (which requires a full ROI-level study), the improved classification accuracy and background rejection capabilities are crucial for the ongoing LEGEND-200 and future LEGEND-1000 experiments.
Resource Efficiency: The ability to leverage abundant unlabeled calibration data via MAE is highly valuable for experimental physics, where labeled datasets are often limited by systematic constraints or the need for expert annotation.
Generalizability: The approach is transferable to other low-background HPGe experiments and potentially other detector technologies where temporal structure is key.
Next Steps: Future work must validate robustness across different detectors and operating conditions, specifically evaluating performance near the $Q_{\beta\beta}$ energy region and propagating these gains into a full sensitivity study.

In summary, this work establishes that Transformer models with MAE pre-training offer a superior, more data-efficient alternative to traditional feature-based methods for HPGe pulse shape discrimination, enabling better background rejection with fewer labeled examples.