Original authors: D. Sorokin, M. Stokolesov, A. Granovskiy, I. Prokofyev, E. Adishchev, M. Nurgaliev, E. Khayrutdinov, G. Subbotin, R. Clark, D. Orlov

Published 2026-05-18

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: D. Sorokin, M. Stokolesov, A. Granovskiy, I. Prokofyev, E. Adishchev, M. Nurgaliev, E. Khayrutdinov, G. Subbotin, R. Clark, D. Orlov

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine a tokamak (a machine designed to create fusion energy) as a giant, invisible, super-hot balloon made of plasma. To keep this balloon from touching the walls and melting the machine, scientists must constantly reshape it, squeezing it into specific forms like a peanut, a circle, or a bean.

The paper you shared describes a new "smart pilot" (an AI agent) that controls this balloon. Here is how it works, explained through simple analogies.

1. The Problem: The Old Way vs. The New Way

The Old Way (The Two-Step Dance):
Traditionally, controlling the plasma was like a two-step dance. First, a team of experts (a computer program) had to look at all the sensors and figure out exactly what shape the balloon was in. Second, a separate controller would take that shape and tell the magnets how to move.

The Flaw: If one of the sensors broke or gave a bad reading, the first step failed, and the whole dance stopped. Also, if the balloon needed to change shape quickly, the two-step process was too slow and rigid.

The New Way (The Intuitive Athlete):
The authors created a Reinforcement Learning (RL) agent. Think of this agent as a gymnast who has practiced thousands of times. Instead of stopping to calculate the shape first, the gymnast feels the wind and the tension and instantly knows how to move.

The Breakthrough: This AI learns to go directly from "sensor readings" to "magnet commands" without needing to explicitly calculate the shape first. It learns to handle the physics directly.

2. The Superpower: Ignoring Broken Sensors

In the real world, sensors break. Maybe a wire gets cut, or a probe gets dirty.

The Analogy: Imagine playing a video game where your controller loses a few buttons randomly every time you start a new level. Most players would quit.
The AI's Trick: The researchers trained this AI by randomly "blinding" 30% of its sensors during practice. They didn't tell the AI which sensors were broken; they just made them go silent.
The Result: The AI learned to play the game perfectly even when it couldn't see half the screen. It learned to rely on the remaining sensors to figure out the shape. This means if a sensor fails during a real experiment, the AI doesn't panic or need a backup plan; it just keeps working with what it has.

3. The Training: The "Shape Gym"

To teach the AI, they didn't just show it one shape. They created a "gym" with 120 different, complex plasma shapes (like different balloon configurations).

The Drill: Every quarter of a second, the AI was told to switch to a completely new shape. It had to learn how to morph from a "peanut" to a "bean" to a "circle" instantly.
The Goal: The AI learned to handle any transition between these shapes, not just a pre-planned route. This is called "zero-shot" learning, meaning it can handle new, unseen sequences without needing extra practice.

4. The "Cheat Sheet" (Asymmetric Training)

Here is a clever trick the researchers used to speed up learning:

The Actor (The Player): During training, the AI only sees what the real machine sees (the sensors).
The Critic (The Coach): The "Coach" AI, however, has a "cheat sheet." It can see the perfect truth of what the plasma is doing (the exact shape, the exact speed), which the real machine can't see.
How it helps: The Coach tells the Player, "You're doing okay, but you're actually 2 centimeters off." This helps the Player learn much faster. Once training is done, the Player is deployed without the Coach, but it has already learned the lessons.

5. The "Side Hustle" (The Auxiliary Head)

The AI has a small extra task: while it is controlling the magnets, it also tries to guess the shape of the plasma on the side.

Why? This acts like a "training wheel." It forces the AI to keep a clear mental picture of the shape, which makes the whole system more stable. It also helps scientists understand which sensors the AI is paying attention to, acting like a window into the AI's brain.

6. The Real-World Test

The researchers didn't just test this in a computer simulation. They took the trained AI and put it on the actual DIII-D tokamak (a real fusion machine in California).

The Result: The AI successfully controlled the real plasma, moving it from one shape to another and keeping it stable, even when some sensors were effectively "ignored" or masked. It performed just as well as, and in some ways more robustly than, the traditional human-designed controllers.

Summary

This paper presents a self-driving car for fusion energy.

It learns by practicing with broken sensors, so it never crashes when a sensor fails.
It learns to change shapes instantly, not just hold a steady position.
It was trained in a high-fidelity simulator but successfully drove the real car (the DIII-D machine) without needing to be re-tuned.

The ultimate goal is to make fusion power plants safer and more reliable by having a controller that can handle the messy, unpredictable reality of the real world.

Technical Summary: Dynamic Plasma Shape Control with Arbitrary Sensor Subsets

Problem Statement

Precise plasma shape control is critical for the safe and efficient operation of tokamaks, influencing energy confinement, heat load distribution, and stability. Classical control systems, such as those deployed on DIII-D and JET, typically employ a two-stage pipeline: first, a real-time equilibrium reconstruction code (e.g., RTEFIT) estimates the plasma boundary from magnetic diagnostics; second, a linear multi-input multi-output (MIMO) controller issues coil commands to track target shapes.

This traditional approach faces three significant limitations:

Fragility to Sensor Failures: Reconstruction algorithms are designed for a full sensor set; missing diagnostics degrade reconstruction accuracy unpredictably, compromising downstream control.
Limited Dynamic Range: Linear controllers are often tuned around a nominal equilibrium, struggling with large, dynamic shape variations or transitions between regimes.
Lack of Adaptability: Handling new failure patterns typically requires manual weight updates between shots, with no capacity for mid-shot adaptation.

While recent Reinforcement Learning (RL) approaches have demonstrated end-to-end control, they generally assume a fixed, fully operational diagnostic set and target static setpoints or pre-planned sequences, failing to address arbitrary dynamic targets or partial sensor availability.

Methodology

The authors present a single Reinforcement Learning (RL) agent designed to address dynamic shape tracking, arbitrary sensor subsets, and partial observability simultaneously.

Environment and Training Distribution

The agent is trained in NSFsim, a high-fidelity tokamak simulator configured for the DIII-D device that models the full power system dynamics, including chopper circuits and coil current constraints.

Goal Space: Instead of uniform random sampling of the 11-dimensional shape goal space (which risks physically unreachable configurations), the authors curated a dataset of 120 experimental Lower Single Null (LSN) shapes drawn from over 329,000 DIII-D equilibria (2014–2020). A greedy diversity criterion ensured these shapes span the full operational envelope.
Dynamic Transitions: During training, the target shape is resampled randomly from this dataset every 0.25 seconds, exposing the agent to diverse transitions across the full shape envelope.

Diagnostic Dropout and Robustness

To achieve robustness against sensor failures without explicit fault detection or mode switching, the authors employ a diagnostic dropout strategy:

At the start of each training episode, a binary mask is sampled by independently zeroing each of the 114 magnetic diagnostic channels (71 probes + 43 loops) with a probability of $p=0.3$ .
The agent receives no explicit indicator of which sensors are missing; it must infer the absence of signals from the pattern of mean-substituted inputs.
This yields a single policy capable of operating gracefully under arbitrary sensor subsets.

Architecture: Asymmetric Actor-Critic with Auxiliary Loss

The agent utilizes an asymmetric actor-critic architecture to handle partial observability:

Actor: Receives a 146-dimensional observation vector comprising magnetic probes, flux loops, coil currents, plasma current ( $I_p$ ), and the 11-dimensional shape goal. Magnetic channels may be masked.
Critic (Privileged): Receives the actor's observation augmented with "privileged" information available only in simulation: signed differences between current and target pivot points ( $\Delta p$ ) and X-point positions ( $\Delta x$ ), along with time derivatives of all inputs. This aids value estimation under partial observability.
Algorithm: The agent is trained using Truncated Quantile Critics (TQC), a distributional off-policy RL algorithm that reduces overestimation bias.
Auxiliary Shape Reconstruction Head: A linear prediction head attached to the actor's penultimate layer predicts the pivot-point error ( $\Delta p$ $Δ p$ ) from raw diagnostics. This loss ( $L_{aux}$ $L_{a ux}$ ) serves two purposes:
1. Training Stabilization: It anchors the actor's internal representation to a physically interpretable geometric quantity, reducing early episode terminations.
2. Interpretability: It enables gradient-based sensor importance analysis and functions as a standalone shape reconstruction module.

Reward Function

The reward combines shape tracking quality and X-point stability using a softmax-weighted average. It penalizes deviations of eight pivot points on the Last Closed Flux Surface (LCFS) and the X-point position, utilizing a soft-minimum mechanism to prevent the agent from sacrificing one objective to optimize the other.

Key Results

Simulation Performance (NSFsim)

Dynamic Tracking: On a held-out static configuration, the agent achieved a mean shape error ( $\bar{d}_{shape}$ ) of 2.01 cm. It successfully tracked dynamic trajectories to extreme configurations (e.g., maximum elongation, rightmost X-point), though errors increased at the boundaries of the coil current envelope due to voltage limits.
Diagnostic Robustness: An agent trained with $p=0.3$ dropout achieved a mean $\bar{d}_{shape}$ of 4.1 cm on a fixed sensor mask corresponding to actual DIII-D failures. This is only 0.7 cm worse than an "oracle" policy trained specifically on that fixed mask, demonstrating that the single policy generalizes to arbitrary subsets without prior knowledge of the failure pattern.
Ablation Studies:
- Removing the asymmetric critic (privileged info) caused the largest performance drop ( $\bar{d}_{shape}$ increased from 4.0 to 4.9 cm).
- Removing the auxiliary loss did not significantly change the mean reward but increased the standard deviation of episode length from 0.7 to 21.0 steps, confirming its role as a training stabilizer.
- Replacing TQC with SAC resulted in lower rewards and significantly higher variance in X-point control, with occasional total loss of control on difficult shapes.

Physical Deployment (DIII-D)

The policy was deployed on the DIII-D tokamak for two dynamic maneuvers:

X-point Radial Sweep: Successfully tracked a target X-point moving from 1.36 m to 1.31 m.
Plasma Centroid Shift: Successfully shifted the plasma centroid between two matched discharges ( $R_c$ from 1.685 m to 1.660 m).

In physical experiments, the RL agent maintained the plasma in the Lower Single Null regime throughout. While the classical isoflux controller showed lower steady-state error in the GSevolve simulator (due to specific tuning for that operating point), the RL agent demonstrated superior robustness to the specific sensor dropout conditions present in the experiment. A "sim-to-real" gap was observed in X-point tracking error for one discharge, attributed to systematic offsets in raw magnetic readings that EFIT absorbs but which shift the RL policy's inputs.

Sensor Importance

Gradient-based analysis of the auxiliary head revealed that the policy relies most heavily on magnetic diagnostics near the 8 target pivot points and the inner limiter wall. The importance rankings were stable across different dropout training rates, suggesting the structure reflects the task geometry rather than training noise.

Significance and Claims

The paper claims to present the first end-to-end control method that simultaneously addresses:

Training Distribution Coverage: Using a curated dataset of experimental shapes to avoid the curse of dimensionality while covering the operational envelope.
Zero-Shot Generalization: The ability to track unseen dynamic shape trajectories without trajectory-specific fine-tuning.
Diagnostic Robustness: A single policy that operates under arbitrary subsets of magnetic diagnostics without backup controllers or explicit fault detection logic.

The authors emphasize that the auxiliary shape reconstruction head not only stabilizes training but also provides a mechanism for interpretability, allowing for the analysis of which sensors drive control decisions. The successful transfer from the NSFsim simulator to the independent GSevolve simulator and finally to the physical DIII-D device validates the approach's potential for real-world tokamak operation under variable diagnostic conditions.

Dynamic Plasma Shape Control with Arbitrary Sensor Subsets