A probabilistic framework for crystal structure… — Plain-Language Explanation

Original authors: Hyuna Kwon, Babak Sadigh, Sebastien Hamel, Vincenzo Lordi, John Klepeis, Fei Zhou

Published 2026-05-12

📖 5 min read🧠 Deep dive

Original authors: Hyuna Kwon, Babak Sadigh, Sebastien Hamel, Vincenzo Lordi, John Klepeis, Fei Zhou

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to identify a specific pattern in a crowded room, but everyone is dancing wildly, shaking hands, and bumping into each other. The room is so chaotic that it's hard to tell who is wearing a red shirt and who is wearing a blue one. This is what scientists face when they look at computer simulations of atoms. The atoms are constantly jiggling due to heat (thermal noise), and sometimes they have missing pieces or extra pieces (defects).

This paper introduces a new "smart assistant" for scientists that does three things at once: it calms the chaos, identifies the pattern, and measures how close the atoms are to that pattern.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Noisy" Crystal

In the atomic world, materials like metals or ice are made of atoms arranged in specific, repeating patterns called crystal prototypes (like a perfect grid of oranges). However, in real life or computer simulations, these atoms are never perfectly still. They vibrate, they get pushed around, and sometimes they are missing.

Old tools were like trying to sort a messy pile of LEGOs by looking at just one piece at a time. If a piece was slightly bent or missing, the tool would get confused or give up.
Old tools also treated "cleaning up the mess" and "identifying the pattern" as two separate jobs. First, you'd try to fix the atoms, and then you'd try to guess what they were.

2. The Solution: A Single "Super-Model"

The authors built a single AI model that acts like a universal translator and a noise-canceling headphone combined.

The "Map" (Log-Probability): Imagine the model creates a 3D map of the entire room. On this map, the "perfect" crystal patterns are high, sunny hills, and the messy, chaotic areas are deep valleys.
The "Denoising" (Walking Uphill): When the model sees a messy atom, it looks at the map and says, "You are in a valley; walk uphill toward the nearest hill." It gently pushes the atoms back toward their perfect positions. This is called denoising.
The "Identification" (Reading the Sign): As the atoms move up the hill, the model also checks the sign at the top of that specific hill. Is it the "Ice" hill? The "Titanium" hill? It instantly knows which pattern the atom belongs to.
The "Confidence Meter" (Order Parameters): The model doesn't just say "Yes" or "No." It gives a score. If an atom is right at the peak, it's 100% sure. If an atom is halfway up the hill (maybe near a defect or a boundary between two materials), the score is lower. This tells the scientist, "I'm pretty sure this is ice, but it's a bit wobbly here."

3. How It Was Trained

The team taught this model using a massive library of perfect crystal structures (from a database called the Materials Project). They didn't just show it the perfect versions; they intentionally shook them, stretched them, and added "static" (noise) to the data.

They taught the model: "When you see a structure that looks almost like this perfect ice pattern, but is messy, push it back to the perfect ice pattern and tell me it's ice."

4. What It Can Do (The Results)

The paper tests this model on some very difficult scenarios:

Melting Ice: It successfully identified different types of ice even when they were vibrating so hard they were almost melting.
Broken Atoms: When they removed atoms from a metal (creating a hole), the model didn't get confused. It correctly identified the surrounding metal as "metal," but it also gave a low confidence score right around the hole, effectively highlighting the defect.
Changing Shapes: It watched atoms slowly transform from one shape to another (like a square turning into a circle). Instead of saying "It's a square" then suddenly "It's a circle," it smoothly tracked the transition, showing the atoms gradually shifting their identity.
Shock Waves: They tested it on Titanium metal being hit by a massive shockwave (like an explosion). The metal was being squashed and twisted violently. The model could still see the different phases forming and tell the scientists exactly where the new, strange phases were appearing, even in the chaos.

5. Why It Matters

The key innovation is unification. Before this, scientists needed one tool to clean the data, another to label it, and a third to measure the disorder. This model does all three in one go.

It's like having a single app that cleans your photo, identifies the person in the photo, and tells you how blurry the photo is, all at the same time. The authors emphasize that while other tools might be slightly better at just one specific task (like pure classification), this tool is the first to combine cleaning, identifying, and measuring uncertainty into one smooth, continuous process.

In short: This paper presents a new way to look at messy atomic data that doesn't just guess what the atoms are, but also gently fixes the mess and tells you how sure it is about its answer.

Technical Summary: A Probabilistic Framework for Crystal Structure Denoising, Phase Classification, and Order Parameters

Problem Statement
Atomistic simulations generate vast amounts of structural data, yet extracting robust phase labels and continuous order parameters (OPs) from noisy configurations remains a significant challenge. Existing tools often suffer from three main limitations:

Specialization: They are frequently restricted to a limited set of well-studied prototypes (e.g., BCC, FCC, HCP) and rely on hand-crafted heuristics (e.g., Common Neighbor Analysis, Bond-Orientational Order Parameters, Polyhedral Template Matching) that degrade under strong thermal disorder, defects, or phase coexistence.
Fragmentation: The processes of thermal-noise removal (denoising), phase classification, and OP construction are typically treated as separate, sequential steps. This separation can lead to the loss of subtle structural information and lacks a unified statistical foundation.
Lack of Continuity: Most approaches focus on discrete classification without providing continuous confidence measures to capture ambiguity near phase boundaries or in highly disordered regions.

Methodology
The authors propose a unified probabilistic framework based on a learned log-probability ( $\log \hat{P}_\theta$ ) landscape over atomic configurations. The core methodology involves:

Model Architecture: The framework utilizes a Graph Neural Network (GNN) based on the MACE (Many-Body Atomic Cluster Expansion) architecture. The model is modified to output per-atom, per-prototype logits ( $l_{ac}$ ), where $a$ indexes atoms and $c$ indexes candidate crystal phases (AFLOW prototypes).
Global Log-Density: A global scalar log-density is constructed by aggregating these logits:
$\log \hat{P}_\theta(r) = \sum_a \log \left( \sum_c \exp(l_{ac}(r)) \right)$
Conservative Denoising: The gradient of this scalar field, $s_\theta(r) = \nabla_r \log \hat{P}_\theta(r)$ , defines a conservative vector field. This field drives an iterative denoising process that refines noisy atomic positions toward high-probability regions of the prototype manifold.
Classification and Order Parameters: The per-atom logits $l_{ac}$ $l_{a c}$ serve dual purposes:
1. Phase Labels: Obtained via $\arg \max_c l_{ac}$ .
2. Continuous OPs: The values of $l_{ac}$ act as continuous, phase-resolved order parameters measuring the similarity of a local environment to a specific prototype.
Training Strategy: The model is trained on a curated subset of Materials Project structures mapped to AFLOW prototypes. The training objective combines two losses:
1. Score-Matching Loss: Trains the model to predict the restoring field (gradient) for structures perturbed by Gaussian noise, analogous to force matching in Machine Learning Interatomic Potentials (MLIPs).
2. Classification Loss: A cross-entropy loss on the logits to ensure the model correctly identifies the prototype label of ideal structures.
Data Augmentation: Training data includes synthetic elastic deformations (isotropic scaling and symmetric strain) and Gaussian positional noise to simulate thermal fluctuations.

Key Contributions

Unified Framework: The work unifies denoising, phase classification, and order parameter extraction into a single differentiable scalar model, eliminating the need for separate pipelines.
Conservative Score Field: Unlike previous non-conservative score-based denoisers, this model derives denoising directions from a conservative gradient field, ensuring thermodynamic consistency in the refinement process.
Continuous and Interpretable OPs: The per-phase logits provide physically interpretable order parameters. The authors demonstrate that these logits behave approximately as the negative squared distance to ideal prototype structures, offering a smooth, continuous measure of structural similarity.
Ambiguity Quantification: The framework naturally provides measures of confidence (logit margins and softmax entropy), allowing for the identification of ambiguous regions near defects, interfaces, or phase boundaries.

Results
The model was evaluated across a wide range of interpolation and extrapolation regimes:

Large-Scale Classification: On a dataset of ~15,000 structures (elemental, binary, and ternary compounds), the model achieved >99% classification accuracy on clean inputs and maintained high accuracy (>99%) on structures perturbed with Gaussian noise up to 0.15 Å, while keeping denoising RMSE below 0.002 Å.
Multi-Phase Ice Polymorphs: The model successfully distinguished seven distinct ice polymorphs (Ic, Ih, II, III, VI, VII, sI) under thermal perturbations, with perfect classification accuracy and sharp logit distributions upon denoising.
Continuous Transformation Paths: Along the Bain (BCC-FCC) and Burgers (HCP-BCC) paths, the logit-based OPs exhibited smooth, continuous transitions, correctly tracking the evolution of structural similarity without discrete jumps.
Thermal Disorder and Defects: In high-temperature MD snapshots (DC3 database) and systems with point defects (vacancies), the model outperformed traditional template-matching methods (PTM, CNA). While PTM often failed to assign labels to atoms near defects or at high temperatures, the probabilistic model maintained robust global phase identification while providing continuous, defect-sensitive OPs that highlighted local disorder.
Complex Systems: The framework successfully handled binary polymorphs (e.g., AgO, ZnO), water-ice coexistence interfaces (distinguishing liquid from solid without training on liquid), and shock-compressed Titanium. In the Ti case, it identified coexisting HCP and $\omega$ phases and tracked the nucleation and growth of the $\omega$ phase via a continuous order parameter ( $l_\omega - l_{HCP}$ ), whereas template-based methods often misclassified distorted $\omega$ regions.

Significance and Claims
The authors emphasize that the primary contribution is not necessarily achieving state-of-the-art classification accuracy on purely discriminative tasks (where specialized classifiers might excel), but rather the unification of multiple structure analysis tasks within a single probabilistic framework.

Key claims regarding the significance of this work include:

Robustness: The model remains effective in extrapolation regimes involving strong thermal disorder, point defects, and non-equilibrium shock compression, where traditional geometric methods fail.
Physical Interpretability: The logit-based OPs offer a direct physical interpretation related to the distance from ideal prototypes, providing a smooth, continuous metric for structural evolution.
Extensibility: The framework is designed to be extensible to new prototypes and chemistries simply by including them in the training set, without requiring system-specific retraining or hand-crafted heuristics.
Integrated Analysis: By providing denoising, classification, and continuous OPs simultaneously, the tool facilitates the analysis of complex microstructures, interfaces, and phase transformations in a way that is difficult with disjointed pipelines.

The paper concludes that this approach offers a practical, integrated tool for analyzing noisy atomistic simulations, bridging the gap between discrete structural classification and continuous thermodynamic characterization.

A probabilistic framework for crystal structure denoising, phase classification, and order parameters