Imagine you are trying to recreate a complex, three-dimensional explosion of energy inside a giant, high-tech camera called a calorimeter. When a particle hits this camera, it doesn't just make a single dot; it creates a "shower" of thousands of tiny energy deposits, like a glitter bomb exploding in slow motion.

Physicists need to simulate these explosions millions of times to understand the universe. The old way of doing this (using a program called Geant4) is like trying to paint every single grain of sand on a beach by hand. It's incredibly accurate, but it takes forever.

This paper introduces CaloArt, a new "AI artist" that can paint these energy explosions in a fraction of a second, without losing the scientific details. Here is how it works, explained simply:

1. The Problem: Too Many Pixels

Think of the energy shower as a giant 3D grid of pixels (called voxels).

Dataset 2 (CCD2): This is a medium-sized grid (about 6,500 pixels). It's like a small, detailed painting.
Dataset 3 (CCD3): This is a massive grid (about 40,500 pixels). It's like a huge, high-definition mural.

The problem is that standard AI models get overwhelmed when the grid gets too big. They try to look at every single pixel individually, which makes them slow and expensive to train.

2. The Solution: "Large Patches"

Instead of looking at every single pixel one by one, CaloArt looks at the image in chunks (or "patches").

Imagine you are reading a book. Instead of reading letter-by-letter (which is slow), you read word-by-word or phrase-by-phrase.
CaloArt reads the energy shower in big blocks. This drastically reduces the amount of work the computer has to do, making it much faster.

3. The Secret Sauce: "x-Prediction" vs. "v-Prediction"

To teach the AI to paint, you have to tell it what to guess. The paper compares two ways of teaching the AI:

The Old Way (v-prediction): Imagine you are trying to guess the final picture, but the teacher only tells you the direction and speed the paint needs to move to get there. It's like being told, "Move the brush slightly up and to the right." This works well for small paintings (Dataset 2), but for huge murals (Dataset 3), the instructions get confusing, and the AI gets lost.
The New Way (x-prediction): Here, the teacher says, "Just tell me what the final picture looks like right now." The AI guesses the final clean image directly.
- The Result: For the small painting (Dataset 2), the old way was fine. But for the huge mural (Dataset 3), the new way (x-prediction) was a game-changer. It allowed the AI to handle the massive grid size without crashing or producing blurry nonsense.

4. The Architecture: A Modernized Engine

The authors built a new engine for this AI called CaloArt. It's based on a modern design called a "Transformer" (the same type of brain behind many modern AI tools), but they upgraded it specifically for 3D energy showers:

3D Positioning: They gave the AI a built-in GPS so it knows exactly where in the 3D space each chunk of energy belongs.
Shared Brains: They made the AI more efficient by having different parts of the network share some of their "thinking" tools, saving memory without losing quality.

5. The Results: Fast and Accurate

The paper tested CaloArt against other top AI models and the traditional "hand-painting" method (Geant4).

On the Small Grid (Dataset 2): CaloArt was the fastest and produced the most accurate results, beating all other AI models in matching the real physics.
On the Big Grid (Dataset 3): This is where CaloArt shined. Because it used the "Large Patch" + "x-prediction" combo, it could generate these massive showers in about 11 milliseconds (less than the blink of an eye) on a single computer chip.
- Other models that tried to do this were either much slower (taking seconds) or produced lower-quality results.
- CaloArt sits on the "Pareto frontier," which is a fancy way of saying it offers the best possible balance between speed and quality. You can't get it faster without making it worse, and you can't make it better without making it slower.

Summary

CaloArt is a new, highly efficient AI that simulates particle collisions by looking at them in big chunks rather than tiny pixels. By using a specific teaching method called x-prediction, it successfully handles the massive, high-resolution data of modern particle detectors. It creates these simulations in milliseconds, making it a powerful tool for physicists who need to process huge amounts of data quickly, all without needing to compress the data first (which often loses important details).

The paper concludes that this approach is a practical, cost-effective way to simulate high-granularity particle showers, saving time and computing power while keeping the physics accurate.

Technical Summary: CaloArt

Problem Statement

High-granularity calorimeters are essential for collider physics but present a significant computational bottleneck for Monte Carlo simulations. Traditional Geant4-based simulations are too slow for the high-luminosity Large Hadron Collider (LHC) and future colliders, which require massive simulated event samples. While machine learning (ML) offers a path to fast simulation, high-granularity data creates a high-dimensional generative modeling problem.

Existing approaches face a trade-off between physics fidelity and computational cost:

Point cloud models handle sparsity well but are less directly tied to grid-based readout cells used in benchmarks.
Voxel-space models (e.g., U-Nets, Transformers) directly model per-cell energy deposits but suffer from rapidly increasing computational costs as voxel counts grow (e.g., from 6,480 voxels in CaloChallenge Dataset 2 to 40,500 in Dataset 3).
Latent-space models reduce dimensionality but require a high-fidelity tokenizer. Calorimeter showers lack a standard perceptual representation (analogous to VGG or DINOv2 for images), making it difficult to train a tokenizer that preserves necessary physics observables without introducing artifacts like blurring.

Consequently, there is a need for a method that performs direct raw voxel generation without a learned autoencoder tokenizer, while managing the computational cost of high-resolution grids.

Methodology

The paper proposes CaloArt, a modernized Diffusion Transformer (DiT) backbone designed for direct 3D voxel shower generation. The methodology rests on three pillars:

1. Large-Patch Tokenization with x-Prediction

To manage the computational cost of high-resolution grids (specifically for Dataset 3), CaloArt employs large 3D patch sizes to reduce the token sequence length.

Prediction Target: The paper investigates the choice between predicting noise ( $\epsilon$ ), flow velocity ( $v$ ), or the clean sample ( $x$ ).
x-Prediction Formulation: For high-dimensional, large-patch regimes (Dataset 3), the authors adopt x-prediction, where the network directly predicts the clean sample $x_\theta$ .
Decoupled Spaces: The training objective uses Conditional Flow Matching (CFM). The prediction space ( $x$ ) is decoupled from the loss space ( $v$ ). The network outputs $x_\theta$ , which is mapped to a velocity prediction $v_\theta = (x_\theta - z_t)/(1-t)$ , and the loss is computed as the mean squared error between $v_\theta$ and the target velocity $v$ . This reweighted $x$ -loss allows the model to leverage the manifold assumption (that clean data lies on a low-dimensional manifold) while maintaining the stability of flow-based training.

2. CaloArt Backbone Architecture

CaloArt is a DiT-style architecture adapted for 3D calorimeter showers, incorporating several modern refinements:

3D Positional Encoding: Uses a combination of 3D Axial Rotary Positional Embeddings (RoPE) and Absolute Positional Embeddings (APE). RoPE phases are constructed separately along the longitudinal ( $z$ ), radial ( $r$ ), and angular ( $\alpha$ ) axes to explicitly encode relative 3D patch positions.
Shared Conditioning Modulation: To improve parameter efficiency, the model uses a PixArt-style shared modulation strategy. Instead of separate modulation projections for every transformer block, a single global modulation tuple is computed from the conditioning signal (incident energy and timestep) and combined with layer-specific trainable embeddings. This reduces the parameter count by ~28% with negligible impact on performance.
Modern Components: The backbone utilizes SwiGLU feed-forward networks, RMSNorm, and query-key normalization, following the "LightningDiT" modernization recipe.

3. Training and Preprocessing

Preprocessing: Voxel energies below 15.15 keV are zeroed. Remaining values undergo a logarithmic transform followed by global standardization.
Outlier Mitigation: For Dataset 3, a redraw strategy is employed where samples with a deposited-to-incident energy ratio exceeding 2.7 are rejected and regenerated to prevent unphysically large energy deposits.
Datasets: The method is evaluated on CaloChallenge Dataset 2 (CCD2) (6,480 voxels) and Dataset 3 (CCD3) (40,500 voxels).

Key Results

Performance on CCD2 (Lower Resolution)

On CCD2, where the voxel count is lower and smaller patch sizes are computationally feasible:

v-prediction remains the superior choice over x-prediction.
CaloArt achieves the best Fréchet-Physics-Distance (FPD) among compared transformer models (14.11 vs. 16.0 for CaloDREAM++).
It achieves the strongest High-level and ResNet classifier AUCs (0.508 and 0.632, respectively), indicating generated showers are difficult to distinguish from Geant4 references.
Generation Time: CaloArt generates showers in 9.71 ms per shower on a single GPU, outperforming non-distilled baselines like CaloDiT-2 EDM and CaloDREAM++.

Performance on CCD3 (High Resolution)

On CCD3, the 40,500-voxel grid necessitates large patches to stay within compute budgets.

x-prediction is critical: Switching from v-prediction to x-prediction improves all reported metrics (FPD, High-level, Low-level, and ResNet AUCs). Under aggressive patch sizes, v-prediction fails to converge to usable samples, while x-prediction remains trainable.
Pareto Efficiency: CaloArt lies on the quality-generation-time Pareto frontier. It achieves an FPD of 42.2 with a generation time of 11.14 ms per shower.
Comparison: Compared to CaloDREAM++ (FPD 26.3, time 96 ms) and convolutional L2LFlows (FPD 171.6, time 16 ms), CaloArt offers a significantly faster inference time while maintaining competitive physics fidelity.

Computational Efficiency

The models are trained on a single NVIDIA A800 GPU.
The CCD3 model trains in 17.57 hours.
The approach avoids the cost of training a separate autoencoder tokenizer, directly generating raw voxels.

Significance and Claims

The paper claims that large-patch tokenization combined with x-prediction provides a compute-efficient route to high-granularity calorimeter shower synthesis.

Direct Generation: It demonstrates that high-fidelity generation is possible without a learned latent tokenizer, which is difficult to design for sparse, physics-constrained shower data.
Scalability: The work establishes that x-prediction is a necessary formulation for training diffusion transformers on high-dimensional raw data (like CCD3) where large patches are required to manage token counts.
Efficiency: By decoupling the prediction target from the loss space and utilizing modern transformer refinements (shared modulation, RoPE), CaloArt achieves state-of-the-art speed-accuracy trade-offs, reducing both training and inference costs for high-granularity simulations.

The authors position CaloArt as a "stronger default DiT backbone" for voxel-based calorimeter generation, offering a practical alternative to latent-space approaches for future high-luminosity collider experiments.

CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation