TokaMark: A Comprehensive Benchmark for MAST Tokamak… — Plain-Language Explanation

Original authors: Cécile Rousseau, Samuel Jackson, Rodrigo H. Ordonez-Hurtado, Nicola C. Amorisco, Tobia Boschi, George K. Holt, Andrea Loreti, Eszter Székely, Alexander Whittle, Adriano Agnello, Stanislas Pamela, Ales

Published 2026-02-13

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine trying to predict the weather inside a star that is trapped inside a giant metal donut. That is essentially what scientists do when they study tokamaks, the machines designed to create clean, limitless fusion energy (the same power that fuels the sun).

The problem? The "weather" inside these machines is chaotic, moves incredibly fast, and we can't stick a thermometer inside it. We only have a handful of sensors on the outside, giving us a blurry, incomplete, and sometimes broken picture of what's happening.

This paper introduces TokaMark, a new "training ground" for Artificial Intelligence (AI) to help solve this puzzle. Here is the breakdown using simple analogies:

1. The Problem: The "Blindfolded Chef"

Think of a fusion reactor as a high-stakes kitchen where a chef (the AI) is trying to cook a perfect meal (stable plasma) without ever seeing the food.

The Sensors: The chef only has a few microphones listening to sizzling sounds, a few thermometers on the oven door, and a camera that sometimes glitches.
The Data Mess: These sensors speak different languages (some talk fast, some slow), they often go silent (missing data), and the information is messy.
The Old Way: Scientists used to try to solve this with complex math equations (physics models). It's like trying to calculate the exact trajectory of every single raindrop in a storm. It's accurate but takes so long to compute that by the time you finish the math, the storm has already changed.

2. The Solution: TokaMark (The "Gym" for AI)

Until now, AI researchers trying to learn how to control these reactors were like athletes training in isolation. One team had a dataset from a machine in the UK, another had data from France, and they all used different rules for scoring. They couldn't compare who was actually the best.

TokaMark is the first standardized "Olympic Gym" for fusion AI.

The Dataset: It gathers a massive library of real data from the MAST tokamak (a real fusion machine in the UK). It's like giving every AI chef the exact same set of recorded cooking sessions to study.
The Rules: It standardizes how the data is cleaned and how the AI is tested. Now, if Team A's AI predicts the weather better than Team B's, we know for sure it's because the AI is smarter, not because the rules were rigged.

3. The 14 Challenges (The "Events")

The benchmark isn't just one test; it's a decathlon of 14 different challenges, grouped into four categories:

Group 1: The Snapshot (Reconstruction)
- Analogy: Looking at a few blurry photos of a car and instantly drawing a perfect 3D model of the whole car.
- Goal: The AI looks at magnetic sensors and instantly figures out the exact shape and position of the invisible plasma ball inside.
Group 2: The Short-Term Forecast (Magnetics)
- Analogy: Watching a soccer ball being kicked and predicting exactly where it will be in the next 2 seconds.
- Goal: Predicting how the magnetic fields will wiggle and shift in the very near future based on current controls.
Group 3: The Slow Drift (Profile Dynamics)
- Analogy: Watching a cup of coffee cool down. It's slower, but you need to remember the history of the room's temperature to know how fast it will cool.
- Goal: Predicting how the heat and density inside the plasma change over time. This is harder because the plasma has "memory."
Group 4: The Disaster Warning (MHD Activity)
- Analogy: A seismologist trying to predict an earthquake before it happens by listening to tiny, subtle cracks in the earth.
- Goal: Spotting the tiny warning signs that the plasma is about to become unstable and crash (a "disruption"). This is the most dangerous and difficult task.

4. The "Baseline" (The Rookie Player)

To make sure the gym is fair, the authors provided a "Rookie Player" (a baseline AI model).

This is a standard, smart AI architecture that everyone can use as a starting point.
It's like giving every new athlete a standard pair of running shoes. If a new team wants to beat the record, they have to build a better shoe (a better AI), not just run on a different track.
The Result: The baseline AI did great at the "Snapshot" and "Short-Term" tasks but struggled with the "Disaster Warning" tasks. This tells us: We know how to build a good AI for simple shapes, but we still need to invent new AI brains to predict disasters.

Why Does This Matter?

Fusion energy is the "holy grail" of clean power. But to make it work commercially, we need to control the plasma in real-time. If the plasma gets unstable, the machine shuts down or gets damaged.

TokaMark is the bridge. It allows AI researchers from all over the world to:

Speak the same language.
Compare their ideas fairly.
Rapidly develop AI that can act as a "co-pilot" for fusion reactors, keeping the star stable so we can finally turn on the lights of the future.

In short: TokaMark is the rulebook and the practice field that will help AI learn to tame the sun.

1. Problem Statement

The development of commercially viable fusion energy reactors (tokamaks) relies on accurate predictions of plasma dynamics. However, current approaches face significant hurdles:

Physics Complexity: Traditional first-principles models (solving coupled non-linear PDEs like the Grad-Shafranov equation) are computationally expensive, making them unsuitable for real-time control or large-scale parameter scans.
Data Heterogeneity: Experimental data from tokamaks is sparse, noisy, incomplete, and multi-modal. Sensors operate at vastly different sampling rates (from 0.2 kHz to 500 kHz), have varying spatial resolutions, and often suffer from missing data or asynchronous acquisition.
Lack of Standardization: Existing AI research in fusion is fragmented. Datasets are often proprietary, facility-specific, and inconsistently annotated. There is no unified benchmark to fairly compare AI models, hindering reproducibility and the development of generalist "Foundation Models" for plasma physics.

2. Methodology: The TokaMark Benchmark

To address these gaps, the authors introduce TokaMark, the first large-scale, open benchmark for evaluating AI models on real experimental data from the MAST (Mega Ampere Spherical Tokamak).

A. Data Source and Taxonomy

Dataset: Utilizes FAIR-MAST, an open dataset containing 11,573 shots from the last five MAST campaigns.
Signal Selection: The benchmark curates 39 signals across heterogeneous modalities, categorized by:
- Origin: Diagnostics (measurements), Actuators (controls), and Derived signals (reconstructed quantities).
- Modality: Time series (1D), Profiles (2D), and Videos (3D).
- Frequency: Ranging from 0.2 kHz to 500 kHz.
Preprocessing: The data is standardized with consistent metadata, units, and formatting. A strict train/validation/test split (80/10/10) is performed at the shot level to prevent data leakage.

B. Task Definition

The benchmark defines 14 downstream tasks organized into four groups, designed to test specific capabilities required for fusion modeling:

Group 1: Equilibrium Reconstruction: Inferring plasma shape, boundaries, and flux maps from instantaneous magnetic measurements (Reconstruction tasks).
Group 2: Magnetics Dynamics: Short-term forecasting of magnetic signals and equilibrium evolution in response to actuator commands (Reconstructive Forecasting).
Group 3: Profile Dynamics: Modeling the temporal evolution of kinetic profiles (density/temperature) and confinement transitions, often dealing with sparse or delayed inputs (Autoregressive and Reconstructive Forecasting).
Group 4: MHD Activity: Long-horizon forecasting of rare, safety-critical events like thermal quenches, vertical displacement events, and Locked Modes. These require integrating long-range temporal context and multi-modal data.

Task Structure: Tasks are defined using input and output windows anchored at a reference time $t_0$ . They vary in temporal dependency, distinguishing between Markovian tasks (short history sufficient) and Non-Markovian tasks (requiring long history to infer latent states).

C. Evaluation Protocol

A hierarchical evaluation metric is introduced to assess performance at three levels:

Signal Level: Normalized Root-Mean-Square Error (NRMSE) per signal, normalized by the signal's empirical standard deviation to allow cross-signal comparison.
Task Level: Average NRMSE across all output signals for a specific task.
Group Level: Average NRMSE across all tasks within a group.
This hierarchy allows researchers to diagnose performance on specific physical regimes while assessing overall scientific utility.

D. Baseline Model

The authors provide a multi-branch convolutional encoder–decoder architecture as a baseline:

Architecture: Independent encoders for each input modality (1D, 2D, or 3D convolutions depending on signal type) feed into a shared latent fusion backbone (MLP). Separate decoders reconstruct each target output.
Training: Trained with Adam optimizer, MSE loss, and early stopping. No physics-informed priors are used, serving as a "physics-agnostic" baseline.

3. Key Contributions

First Open Benchmark: TokaMark is the first standardized, open-source benchmark specifically for MAST tokamak data, enabling fair comparison across the AI-for-Science community.
Comprehensive Task Suite: Defines 14 diverse tasks covering the full spectrum of fusion challenges, from fast magnetic dynamics to slow transport evolution and rare disruption events.
Data Engineering: Resolves schema inconsistencies in FAIR-MAST, standardizes metadata/units, and provides robust tools for handling missing data, multi-rate alignment, and batching.
Tooling and Reproducibility: Releases a Python package integrated with PyTorch for data loading, processing, and evaluation, along with a strong baseline model and training scripts.

4. Experimental Results

The baseline model was evaluated across all 14 tasks (Table 3 in the paper):

High Performance (Groups 1 & 2): The model achieved low Group NRMSE scores (0.163 and 0.126 respectively) for Equilibrium Reconstruction and Magnetics Dynamics. This indicates that for fast, linear-dominated dynamics, generic deep learning architectures can effectively learn the mapping from diagnostics to equilibrium.
Moderate Performance (Group 3): Profile Dynamics tasks showed higher errors (Group NRMSE ~0.34), reflecting the difficulty of modeling non-linear transport physics and integrating sparse, multi-rate data.
Low Performance (Group 4): MHD Activity tasks yielded the highest errors (Group NRMSE ~0.48), with Task 4-5 (Locked Modes) exceeding an NRMSE of 1.0. This suggests that predicting rare, non-linear instability precursors is extremely difficult for current generic architectures without specific physics constraints or more sophisticated temporal modeling.
Interpretation: The results highlight that while AI can surrogate fast magnetic responses well, modeling complex transport and rare disruptions remains a significant challenge, defining the "difficulty floor" for future research.

5. Significance

Accelerating Fusion Research: By providing a unified framework, TokaMark removes the barrier of data fragmentation, allowing researchers to focus on algorithmic innovation rather than data wrangling.
Bridging Communities: It fosters collaboration between the fusion physics community and the broader AI/ML community, encouraging the application of state-of-the-art models (e.g., Transformers, Diffusion models) to real-world scientific problems.
Path to Generalist Models: The benchmark is designed to support the development of "Foundation Models" for plasma physics—models that learn transferable latent representations from massive datasets, potentially reducing the need for hand-crafted pipelines for every new diagnostic or task.
Safety and Control: Success in Group 4 tasks (MHD/Disruptions) is critical for the safe operation of future reactors like ITER and DEMO, where early detection of instabilities is vital for machine protection.

In conclusion, TokaMark establishes a rigorous, reproducible standard for data-driven plasma modeling, marking a pivotal step toward realizing the potential of AI in achieving sustainable fusion energy.

TokaMark: A Comprehensive Benchmark for MAST Tokamak Plasma Models