Original authors: Joschka Birk, Frank Gaede, Anna Hallin, Gregor Kasieczka, Martina Mozzanica, Henning Rose

Published 2026-06-11

📖 5 min read🧠 Deep dive

Original authors: Joschka Birk, Frank Gaede, Anna Hallin, Gregor Kasieczka, Martina Mozzanica, Henning Rose

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a computer to recreate the complex, messy "shower" of particles that happens when a high-energy photon hits a detector in a particle physics experiment. This isn't just a simple picture; it's a 3D cloud of thousands of tiny energy deposits, each with a specific location and amount of energy.

The paper introduces a new AI method called SPADE (Split-and-Delay Embeddings) to do this job faster and more accurately than previous methods. Here is how it works, explained through everyday analogies.

The Problem: The "All-in-One" Dictionary

Previous AI models tried to describe every single particle hit by turning its location ( $x, y, z$ ) and energy ( $E$ ) into one giant, unique ID number, like a library book code.

The Analogy: Imagine you are describing a house. Instead of saying "3 bedrooms, 2 bathrooms, 2000 sq ft," you assign the house a single, massive code like "74,829,102."
The Issue: If you want to describe houses with more detail (higher resolution), the number of possible codes explodes. To handle a high-resolution detector, the AI needs a dictionary with millions of codes. This makes the AI huge, slow to train, and prone to forgetting details because the dictionary is so sparse. It's like trying to learn a language where every sentence requires a unique, never-before-seen word.

The Solution: SPADE's "Split and Delay" Strategy

SPADE changes the rules. Instead of treating the location and energy as one giant code, it breaks them apart and feeds them to the AI one by one, with a specific timing trick.

1. Split: Breaking the House into Rooms

Instead of one giant code for the whole house, SPADE describes the house by listing its features separately:

"It's on the 3rd floor."
"It's in the 5th row."
"It's in the 10th column."
"It has 500 units of energy."

The Benefit: The AI doesn't need a dictionary of millions of codes. It just needs three small dictionaries (one for rows, one for columns, one for floors) and one for energy. This is like learning to spell words letter-by-letter instead of memorizing a dictionary of every possible sentence. It makes the AI much smaller and easier to train.

2. Delay: The "Wait a Beat" Trick

If the AI just lists the features separately ("Row 3... Column 5... Energy 500"), it might forget that they all belong to the same hit. It might accidentally mix up the energy of one hit with the location of another.

The Analogy: Imagine a conductor leading an orchestra. If everyone plays their part at the exact same time, it's chaos. But if the conductor says, "Violins, play now. Cellos, wait one beat. Flutes, wait two beats," the musicians can hear what the others played just before them and adjust their own playing to fit perfectly.

SPADE does this by delaying the information.

It tells the AI: "Here is the Z-coordinate."
Wait a beat.
"Here is the X-coordinate (now you know the Z, so you can relate to it)."
Wait a beat.
"Here is the Y-coordinate (now you know X and Z)."
Wait a beat.
"Here is the Energy (now you know the exact location, so you can match the energy to the spot)."

By the time the AI predicts the energy, it has already "seen" the location. This allows the AI to learn the crucial relationship between where a hit is and how much energy it has, without needing to cram them into a single code.

The Results: Why It Matters

The authors tested SPADE against two other methods:

The Old Way (OmniJet-αC): Used the giant "all-in-one" code. It was slow and lost detail.
The "Combined" Way: Tried to list features separately but without the clever "delay" trick. It was better but still struggled to scale up.
SPADE: Used the Split-and-Delay method.

The Findings:

Accuracy: SPADE recreated the particle showers more accurately than the old methods, matching the "gold standard" physics simulations (Geant4) very closely.
Efficiency: Because it didn't need a massive dictionary, SPADE was 6.9 times faster to train and required 74 times fewer parameters (memory) than the "Combined" method when dealing with high-resolution data.
Scalability: As the detector gets more detailed (higher granularity), the old methods get exponentially slower and heavier. SPADE stays light and fast, growing only linearly.

The Bottom Line

SPADE is like teaching an AI to paint a complex 3D picture not by memorizing every possible finished painting, but by teaching it to place individual dots of color one by one, ensuring each dot knows exactly where the previous dots were placed. This allows it to handle incredibly detailed images (simulations) without needing a supercomputer to store the instructions.

The paper concludes that this "Split-and-Delay" technique isn't just for particle physics; it could be a new way to handle any complex data where multiple features (like location, time, and intensity) need to be generated together, potentially helping fields like astronomy or any area dealing with high-dimensional sensor data.

Technical Summary: SPADE – Split-and-Delay Embeddings for Autoregressive High-Granularity Calorimeter Simulation

Problem Statement

High-energy physics (HEP) experiments require vast amounts of Monte Carlo (MC) samples for detector simulation. Traditional tools like GEANT4 provide high-fidelity results but are computationally prohibitive, particularly for highly granular calorimeters where the demand for resources is expected to outstrip availability. While generative machine learning (ML) models (GANs, VAEs, diffusion models) offer alternatives, recent foundation models based on autoregressive transformers (e.g., OmniJet-α) face specific challenges when applied to calorimeter showers:

Inefficient Tokenization: Existing approaches often use Vector Quantized Variational Autoencoders (VQ-VAE) to convert continuous spatial and energy features into discrete tokens. This introduces information loss and creates a "bottleneck" where the vocabulary size scales cubically ( $O(N^3)$ ) with detector granularity, leading to an explosion in model parameters and training costs.
Correlation Loss: Treating multi-feature tokens (spatial coordinates $x, y, z$ and energy $E$ ) as a single unit or predicting them independently without conditioning can fail to capture the crucial intra-token correlations necessary for realistic shower reconstruction.
Scalability: Current autoregressive models struggle to scale to the extreme granularities required by future collider detectors (e.g., the ILD) without becoming computationally intractable.

Methodology

The paper introduces SPADE (SPlit And Delay Embeddings), an autoregressive transformer architecture designed to handle sequences of tokens carrying multiple features without lossy compression.

Core Architectural Innovations

Split Embeddings (Factorization):
Unlike previous models that embed a 3D voxel index as a single token (scaling vocabulary as $N_x \cdot N_y \cdot N_z$ ), SPADE splits the four hit features ( $x, y, z, E$ ) into independent prediction streams.
- Spatial coordinates are embedded independently into 64-dimensional vectors.
- The vocabulary size scales linearly ( $V = N_x + N_y + N_z$ ) rather than multiplicatively.
- This eliminates the need for a VQ-VAE, preserving continuous information and avoiding the information loss inherent in vector quantization.
Delay Mechanism (Staggered Conditioning):
To prevent the loss of correlations between the split features (e.g., between position and energy), SPADE employs a progressive delay strategy along the sequence.
- Instead of generating a hit all at once, the model builds each hit sequentially.
- The input at sequence position $i$ contains components from different hits: $z_i$ , $x_{i-1}$ , $y_{i-2}$ , and $E_{i-3}$ .
- This allows the standard self-attention mechanism to learn intra-token correlations autoregressively. By the time the model predicts a specific feature (e.g., $E_i$ ), it has already seen the other features of that same hit ( $z_i, x_i, y_i$ ) in previous steps, effectively conditioning the prediction on the full context of the current hit.
Model Components:
- Energy Head: Uses a Mixture-of-Gaussians (MoG) head to predict continuous energy, conditioned on the spatial coordinates via the delay mechanism.
- Stop Head: A dedicated binary classifier (independent of the backbone output) determines sequence termination, addressing issues with stop-token entanglement found in prior models.
- Backbone: A decoder-only transformer using Rotary Position Embedding (RoPE), Multi-Query Attention, and FlashAttention for efficiency.

Baselines and Comparisons

The authors compare SPADE against:

OmniJet-αC: The predecessor using VQ-VAE tokenization.
Combined: A baseline that removes VQ-VAE but uses a single combined spatial vocabulary ( $N_x \cdot N_y \cdot N_z$ ) with a single delay for energy.
AllShowers: A state-of-the-art flow-matching reference model.

Key Contributions

Scalable Architecture: SPADE demonstrates that autoregressive models can scale to high detector granularities by reducing parameter counts from cubic to linear scaling relative to grid resolution. At $x16$ granularity, SPADE uses a factor of 74 fewer parameters than the Combined baseline.
Lossless Feature Handling: By eliminating the VQ-VAE, SPADE avoids the spatial and energetic artifacts associated with lossy compression, enabling direct use of discrete grid coordinates and continuous energy values.
Correlation Preservation: The delay mechanism successfully recovers the energy-position correlations that are often lost when features are predicted independently or jointly without sequential conditioning.
Training Efficiency: SPADE converges faster and to lower validation losses than the Combined model, requiring significantly fewer GPU hours (e.g., 25.8 vs. 178.7 hours at $x16$ granularity).

Results

The models were evaluated on two photon shower datasets derived from Geant4 simulations of the ILD detector: GettingHigh (irregular grid) and GettingSquare (regular grid with varying granularities).

Performance on GettingHigh: SPADE is competitive with the state-of-the-art AllShowers model on most observables and substantially outperforms OmniJet-αC. It achieves the best agreement on the ratio of deposited to incident energy and the center of gravity, validating the efficacy of the staggered conditioning scheme.
Performance on GettingSquare:
- SPADE outperforms the Combined baseline on observables probing spatial structure (e.g., center of gravity), where the Combined model suffers from token sparsity in the large vocabulary.
- SPADE scales linearly with granularity, whereas the Combined model's parameter count and training cost increase prohibitively.
- While AllShowers (non-autoregressive) remains the fastest generator, SPADE generates showers roughly twice as fast as the Combined model and achieves comparable or superior physics fidelity.
Failure Modes: A specific failure mode where SPADE occasionally halts generation prematurely (under-predicting energy) affects ~0.35% of showers. The authors implement a post-processing filter to reject these outliers, ensuring physics results are reported on valid samples.

Significance and Claims

The paper posits that SPADE represents a significant step forward in applying foundation model paradigms to high-dimensional, multi-feature physics data.

Beyond Tokenization: It challenges the necessity of lossy tokenization (VQ-VAE) for numerical data, demonstrating that splitting features and using delay-based conditioning is a more effective strategy for autoregressive generation.
Practicality for Future Detectors: By solving the parameter scaling problem, SPADE makes autoregressive transformers a viable architecture for the highly granular calorimeters of future collider experiments, where current methods are computationally prohibitive.
General Applicability: The authors claim the split-and-delay mechanism is applicable to any generative task involving tokens with multiple features (discrete or continuous), potentially enabling LLM-style pretraining workflows for higher-dimensional data in HEP and other fields (e.g., astrophysics).

The work concludes that while autoregressive generation is inherently slower than flow-based methods, the improvements in representational efficiency and physics fidelity over single-stream combined tokenization models make SPADE a critical building block for future foundation models in scientific domains.

SPADE: Split-and-Delay Embeddings for Autoregressive High-Granularity Calorimeter Simulation