Cascade Pipeline for Leading-Order Matrix Element… — Plain-Language Explanation

Original authors: P. Leguina López, C. Vico Villalba, F. Hervás Álvarez, H. Gutiérrez Arance, S. Folgueras, L. Fiorini, A. Valero, J. Fernández Menéndez, F. Carrió, A. Oyanguren

Published 2026-05-05

📖 4 min read🧠 Deep dive

View on arXiv ↗PDF ↗

CC BY 4.0

Original authors: P. Leguina López, C. Vico Villalba, F. Hervás Álvarez, H. Gutiérrez Arance, S. Folgueras, L. Fiorini, A. Valero, J. Fernández Menéndez, F. Carrió, A. Oyanguren

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to predict the outcome of a massive, chaotic collision between two tiny particles (like protons) inside a giant particle accelerator. To do this, physicists use a complex mathematical recipe called a "matrix element." Calculating this recipe is like solving a giant, multi-step puzzle. The problem is that to get a reliable answer, they have to solve this same puzzle millions of times, each time with slightly different starting conditions.

Currently, doing this on standard computer processors (CPUs) is like trying to solve these puzzles one by one with a single person. It's accurate, but it's incredibly slow and uses a lot of energy, especially as the particle accelerator gets more powerful.

This paper presents a new way to solve these puzzles using a special type of computer chip called the AMD Versal AI Engine. Instead of having one person solve the whole puzzle, the authors built a factory assembly line right inside the chip.

Here is how their solution works, broken down into simple concepts:

1. The "Assembly Line" Problem

The mathematical recipe for this specific particle collision (two gluons turning into a top quark, an anti-top quark, and another gluon) is too big to fit into the memory of a single tiny processor on the chip. Think of it like trying to fit a 38-page instruction manual into a pocket that can only hold 16 pages.

The Solution: The authors split the manual into five chapters. They created a five-stage assembly line.

Stage 1: Reads the raw ingredients (the collision data) and prepares the first few steps.
Stage 2 & 3: Pass the work down the line, adding more steps to the calculation.
Stage 4 & 5: Finish the final calculations and spit out the answer.

2. The "Conveyor Belt" (Cascade Pipeline)

These five stages are connected by a super-fast, dedicated conveyor belt called a cascade interface.

Imagine a factory where workers don't stop to talk or wait for permission to pass a box to the next person. They just slide the box down a chute instantly.
In this chip, the "boxes" are chunks of data called tokens.
The authors designed a strict rulebook (a "deterministic contract") to ensure the workers never get stuck waiting for each other. Every worker knows exactly when to pass a box and when to receive one, so the line never jams.

3. The "Super-Factory" (80 Lines at Once)

The chip they used (the VCK190) is like a massive warehouse containing 400 tiny workers (called tiles).

Instead of building just one assembly line, they built 80 identical assembly lines side-by-side.
Each line has 5 workers. $80 \text{ lines} \times 5 \text{ workers} = 400 \text{ workers}$ .
They are all working at the same time, solving 80 different puzzles simultaneously.

4. The Results: Speed and Efficiency

The authors tested this "factory" against two other methods: a standard computer processor (CPU) and a high-end graphics card (GPU).

Speed: Their 80-line factory is 34 times faster than a single standard computer core.
- Note: A top-tier graphics card (GPU) is still faster overall (about 22 times faster than their chip), but the GPU is a much larger, more expensive machine.
Energy: This is where their method shines. Because the assembly line is so efficient and specialized, it uses very little electricity.
- To solve one puzzle, their chip uses 7.7 times less energy than a standard computer processor.
- It is less energy-efficient than the giant GPU, but the GPU consumes a massive amount of power to do it. The chip's method is a "sweet spot" for situations where you need speed but can't plug in a massive power-hungry machine.

5. Accuracy Check

They made sure their "assembly line" didn't make mistakes. They compared the answers from their chip against a "gold standard" double-precision calculation.

The results matched almost perfectly. The difference was so tiny (about 1 part in a million) that it is considered negligible for the physics calculations they are doing.

Summary

In short, the authors took a complex physics calculation that was too big for a single computer chip, chopped it into five manageable pieces, and built 80 parallel assembly lines to solve them all at once. This approach creates a "sweet spot" of high speed and low energy consumption, offering a powerful alternative for running the simulations needed to understand the universe at the Large Hadron Collider.

1. Problem Statement

Modern High Energy Physics (HEP) event generators, such as MadGraph5_aMC@NLO (MG5aMC), face a significant computational bottleneck in evaluating the matrix element ( $|M|^2$ ) for particle collisions. As the Large Hadron Collider (LHC) enters its High-Luminosity phase, the demand for these calculations grows non-linearly, while CPU scaling remains limited.

The Bottleneck: Matrix element evaluation accounts for 30–40% of total event generation time, particularly for multi-jet processes involving additional real emissions.
The Challenge: Existing solutions using GPUs (e.g., CUDACPP) offer high throughput but consume significant power. Field-Programmable Gate Arrays (FPGAs) offer energy efficiency but struggle with the 16 kB per-tile program memory (PM) constraint on modern AI Engine arrays. A monolithic implementation of complex processes (like $gg \to t\bar{t}g$ ) exceeds this memory limit, preventing direct mapping to a single tile.

2. Methodology

The authors propose a cascade pipeline architecture on the AMD Versal AI Engine (AIE) array (specifically the VCK190 platform) to overcome memory constraints and maximize parallelism.

A. Target Platform & Architecture

Hardware: AMD Versal XCVC1902 ACAP featuring 400 AI Engine tiles arranged in a $50 \times 8$ grid, clocked at 1.25 GHz.
Pipeline Decomposition: The $gg \to t\bar{t}g$ $g g \to t \overset{ˉ}{t} g$ process (involving 16 Feynman diagrams and 10 distinct HELAS functions) is decomposed into a five-stage pipeline.
- Stage 1: Wavefunction generation (external spinors/vectors) and token initialization.
- Stages 2 & 3: Fermion-vector vertex evaluations (splitting 12 diagrams to balance memory).
- Stage 4: Triple-gluon vertex evaluations (including a deferred off-shell boson generator).
- Stage 5: Four-gluon contact terms and color matrix reduction.
Memory Management: To fit within the 16 kB limit, the authors employed program memory partitioning and deferred evaluation. For instance, the off-shell boson generator ( $FFV1P0\_3$ ) was moved from Stage 1 to Stage 4, reducing Stage 1's memory usage from 17.8 kB to 15.5 kB.

B. Inter-Tile Communication (Cascade Protocol)

Mechanism: Stages communicate via the 384-bit unidirectional cascade interface (60 GB/s bandwidth).
Token Protocol: A deterministic, deadlock-free protocol is used where stages exchange structured "tokens" containing wavefunctions and partial amplitudes.
- Extended Token (Stages 1–4): Carries 5 external wavefunctions, 3 precomputed propagators, and 6 color-flow amplitudes (18 beats/helicity).
- Reduced Token (Stages 4–5): Carries only 5 wavefunctions and amplitudes (12 beats/helicity) after local evaluation of the triple-gluon vertex.
Determinism: The system enforces a "cascade contract" with identical loop structures, unconditional writes, and statically matched token counts to ensure zero-overhead, flow-control-free operation.

C. Software Adaptations

HELAS Library Porting: The standard MG5aMC HELAS library (originally scalar double-precision C++) was ported to AI Engine vector intrinsics (single-precision float32).
Optimizations:
- Vectorization: Wavefunctions mapped to 8-wide SIMD vectors.
- Complex Division: Replaced Smith's method (2 divisions) with a single hardware reciprocal instruction.
- Helicity Caching: Precomputed 10 wavefunctions for 32 helicity configurations, selecting them via bit-indexed lookup to reduce evaluations by 16×.
- Color Reduction: Folded color normalization divisions into compile-time constants.

D. System Deployment

Scale: 80 independent pipelines were mapped onto the 400 available tiles (5 tiles per pipeline).
I/O: A packet-switched architecture in the Programmable Logic (PL) distributes phase-space points to the pipelines and collects results.

3. Key Contributions

Memory-Driven Pipeline Architecture: Introduced a novel 5-stage cascade pipeline that successfully partitions a complex multi-diagram matrix element computation across multiple AI Engine tiles, overcoming the 16 kB PM constraint.
Deterministic Cascade Contract: Developed a deadlock-free communication protocol using wavefunction tokens and static loop structures, eliminating the need for complex flow-control hardware.
Full HELAS Port: Successfully ported the complete HELAS amplitude library to AI Engine vector intrinsics, incorporating complex optimizations like binary-indexed helicity caching and reduced complex division.
Scalable Deployment: Demonstrated a theoretical deployment of 80 pipelines utilizing 100% of the VCK190's AI Engine compute resources.

4. Results

Throughput: The projected throughput for the 80-pipeline array is $1.0 \times 10^6$ matrix element evaluations per second (ME/s).
- This represents a 34× speedup over a single-threaded CPU core (Intel i5-10600).
- While lower than an NVIDIA A100 GPU ( $2.18 \times 10^7$ ME/s), the AI Engine solution is significantly more power-efficient.
Energy Efficiency:
- AI Engine: 54.8 µJ/ME (at 54.8 W AIE-domain power).
- CPU: 422 µJ/ME.
- GPU: 7.3 µJ/ME (but at 159 W power).
- Improvement: The AI Engine offers a 7.7× improvement in energy efficiency compared to the CPU baseline.
Precision: Validated against the MG5aMC double-precision reference.
- Mean relative error: 1.43 ppm (parts per million).
- Maximum relative error: 168 ppm.
- This level of precision is deemed sufficient for Leading-Order (LO) calculations where physical uncertainties (scale variation, PDFs) dominate numerical errors.
Resource Utilization:
- Program Memory: Stage 1 is the bottleneck at 94.7% utilization (15,514 bytes).
- Programmable Logic: Modest usage (4.72% LUTs, 2.87% Registers), leaving room for additional logic.

5. Significance and Future Work

Significance: This work proves that AI Engine arrays are viable for high-throughput, energy-efficient HEP event generation, particularly for power-constrained environments (e.g., online trigger systems or edge computing at the LHC) where GPU power envelopes are prohibitive. It establishes a systematic methodology for partitioning complex physics kernels across constrained memory tiles.
Limitations: The current implementation is limited to Leading-Order (LO) processes. The latency figures are based on cycle-approximate simulators rather than direct hardware timing of the full array.
Future Directions:
- Helicity Filtering: Pre-computing valid helicity masks to reduce the inner loop iterations, potentially doubling throughput.
- Higher Multiplicity: Extending the pipeline depth for more complex processes (e.g., $t\bar{t}ggg$ ).
- NLO Integration: Adapting the architecture for Next-to-Leading-Order calculations involving loop integrals.
- Hardware Evolution: Leveraging next-generation Versal devices with larger arrays or higher clock speeds.

In conclusion, the paper presents a robust, energy-efficient alternative to GPU acceleration for specific HEP workloads, leveraging the unique cascade capabilities of the AMD Versal AI Engine to solve the memory partitioning challenges inherent in complex matrix element calculations.

Cascade Pipeline for Leading-Order Matrix Element Evaluation on AMD Versal AI Engine Arrays