🔬 materials science

Symmetry-restricted energy landscapes as a benchmark for machine learned interatomic potentials

This paper introduces a symmetry-restricted benchmark that systematically evaluates the fidelity of universal machine-learned interatomic potentials by comparing their predicted two-dimensional potential energy surface slices against DFT calculations to reveal artifacts and assess their ability to capture critical topological features like local minima and saddle points.

Original authors: Abhijith S Parackal, Rickard Armiento, Florian Trybel

Published 2026-02-03

📖 4 min read☕ Coffee break read

CC BY 4.0

Original authors: Abhijith S Parackal, Rickard Armiento, Florian Trybel

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to navigate a vast, foggy mountain range. Your goal is to find the deepest valley (the most stable state) and understand the shape of the hills and ridges around it. In the world of materials science, this "mountain range" is called a Potential Energy Surface (PES). It's a map that tells scientists how much energy a specific arrangement of atoms has.

For a long time, the only reliable way to draw this map was using Density Functional Theory (DFT). Think of DFT as a super-accurate, high-resolution satellite camera. It sees every tiny detail of the terrain perfectly. However, it's incredibly slow and expensive to use, like trying to survey a whole continent by walking every inch of it with a tape measure.

To speed things up, scientists started using Machine Learned Interatomic Potentials (MLIPs). These are like AI-powered GPS apps. They have been trained on millions of "satellite photos" (data from DFT) so they can predict the terrain instantly. Recently, "Universal" versions of these GPS apps (like MACE, CHGNet, and ORB) have been released. They claim to work for any material, not just the ones they were specifically trained on.

The Problem:
While these AI GPS apps are fast and usually accurate, nobody really knew if they were drawing the entire map correctly. They might get the main valley right, but what about the tricky ridges, the hidden caves, or the steep cliffs far away from the center? If the AI hallucinates a fake valley or misses a cliff, it could lead scientists to believe a material is stable when it's actually going to collapse.

The Solution: The "Symmetry Slice" Test
The authors of this paper created a new way to test these AI models. Instead of trying to map the whole 3D mountain range (which is too complex to visualize), they decided to take 2D slices of the terrain.

Here is how they did it, using a simple analogy:
Imagine a crystal structure is like a complex Lego castle. The castle has rules (symmetry) that say certain bricks must move together. If you move one red brick, three other red bricks must move in the exact same way.

Pick two "knobs": The researchers picked two specific ways the Lego bricks could wiggle (called "Wyckoff degrees of freedom").
Turn the knobs: They turned these two knobs through every possible combination, creating a grid of different castle shapes.
Draw the map: For every shape, they asked the AI: "How much energy does this cost?" and compared it to the "Super-Resolution Camera" (DFT).
The Result: They got a colorful contour map (like a topographic map) showing hills and valleys.

What They Found:
By looking at these 2D maps, they discovered some surprising things about the AI models:

The "Smooth" Lie: Near the bottom of the valley (where atoms are happy and stable), almost all the AI models were perfect. They matched the DFT camera perfectly.
The "Ghost" Valleys: In some cases, the AI models invented fake valleys. For example, in a material called AlTiN3, one version of the AI (MACE_MPA-0) showed a deep, attractive valley where the real physics said there was nothing but a flat plain. If a scientist used this AI to design a new material, they might get "stuck" in this fake valley and think they found a new stable structure, when in reality, it doesn't exist.
The "Cliff" Problem: When atoms were pushed too close together (like crashing two Lego bricks into each other), some AI models started behaving strangely. Instead of saying "This is impossible and costs infinite energy," some models said, "Oh, this is actually very low energy!" This is like a GPS telling you to drive straight through a mountain because it thinks the mountain is a tunnel. This happens because the AI was never trained on these "crash" scenarios.
The "Narrow" View: One model (ORB v2) was so cautious that it flattened the whole map. It showed a very small difference between the highest hill and the lowest valley, missing the dramatic ups and downs that the real physics shows.

The Takeaway
This paper doesn't just say "AI is good" or "AI is bad." It provides a visual benchmark. It's like giving a driving instructor a way to see exactly where a student driver is making mistakes, rather than just looking at the final score.

The authors show that while these universal AI models are powerful tools for discovering new materials, they can still have "blind spots" or "hallucinations" in complex or extreme situations. By using these 2D symmetry slices, scientists can now visually inspect these models, spot the fake valleys, and fix them before relying on them for important discoveries. It's a quality control check for the future of materials science.

Problem Statement
Machine-learned interatomic potentials (MLIPs), particularly universal pre-trained models (uMLIPs) based on architectures like MACE, CHGNet, and ORB, have become standard tools for large-scale materials discovery and molecular dynamics due to their DFT-level accuracy and computational efficiency. However, while these models perform well on standard validation metrics (e.g., root mean square errors on energies and forces), their fidelity in reproducing the detailed topology of potential energy surfaces (PES) remains poorly understood. Specifically, there is uncertainty regarding their ability to accurately capture high-energy local minima, saddle points, and gradients far from equilibrium. Previous studies have noted issues such as the "softening" of energy surfaces away from minima and the prediction of unphysical structures during geometry optimization, often attributed to biased sampling of near-equilibrium configurations in training datasets. Current benchmarking methods often rely on opaque scalar error values that fail to reveal specific topological artifacts or structural failures in the energy landscape.

Methodology
The authors propose a systematic workflow to visualize and evaluate the PES of uMLIPs by constructing symmetry-restricted two-dimensional slices of the energy landscape (s2DPES). The methodology involves:

Symmetry Constraints: Utilizing Wyckoff positions to define symmetry-equivalent atomic sites within a crystal structure. This reduces the dimensionality of the configuration space by varying only the degrees of freedom (DOF) allowed by the crystal's space group.
Grid Generation: Creating a 2D meshgrid by varying two selected Wyckoff DOFs (e.g., x and z coordinates of specific atoms) within a defined range and step size.
Distance Filtering: Implementing a cost function based on the sum of Wigner-Seitz radii to penalize and exclude unphysical atomic configurations where interatomic distances fall below a minimum threshold, ensuring that artifacts arising from atomic overlap are identified.
Energy Calculation: Computing the energy for each grid point using various uMLIPs (including MACE variants, ORB, CHGNet, and SevenNet) and comparing them against Density Functional Theory (DFT) reference calculations.
Visualization: Generating contour plots of the resulting 2D energy landscapes to allow for direct visual comparison of local minima, saddle points, and overall surface curvature between different models and DFT.

Key Contributions

Benchmarking Framework: The paper introduces a reproducible workflow for generating s2DPES, enabling a direct, visual comparison of MLIP predictions against DFT references. This approach moves beyond scalar error metrics to assess the physical accuracy of the PES topology.
Systematic Analysis: The method allows for the isolation of specific structural features (local minima, saddle points) and the identification of model-specific artifacts, such as spurious energy drops in regions of atomic overlap or the prediction of non-existent local minima.
Model Comparison: The study evaluates a diverse set of state-of-the-art uMLIPs, including multiple generations of MACE models trained on different datasets (Materials Project, Alexandria, OMat24, MATPES), as well as ORB, CHGNet, and SevenNet.

Results
The application of the s2DPES workflow to three distinct crystal systems ( $W_2N_3$ , $AlTiN_3$ , and $Cu_2O_8S_4$ ) revealed several critical findings:

General Performance: Most models accurately capture the local energy minimum and the general curvature of the PES near equilibrium for structures outside their training data.
Artifacts in Overlap Regions: Models lacking explicit repulsion terms (e.g., SevenNet0, CHGNet, and to a lesser extent ORB v2) exhibited unphysical energy drops in regions of significant atomic overlap, a consequence of these configurations being absent from training datasets.
Model-Specific Artifacts:
- MACE_MPA-0: In the $AlTiN_3$ system, this model predicted a distinct local minimum in a region where DFT and other MACE models indicated no stable configuration. This artifact caused geometry optimizations to become trapped in a spurious basin, highlighting the risks of relying on a single model for structure search.
- MACE_MATPES-PBE: In the $Cu_2O_8S_4$ system, this model converged to a different local minimum compared to other models and DFT, even after lifting symmetry constraints.
Progression of Quality: Newer models, such as MACE_OMAT-0 (trained on larger datasets like OMat24), demonstrated energy landscapes that more closely matched DFT references, suggesting that improvements in training data and architectural refinements enhance PES fidelity.
Energy Range Discrepancies: ORB v2 predicted a significantly narrower energy range compared to other models, indicating potential limitations in capturing the full energetic span of the landscape.

Significance
The paper argues that visualizing symmetry-constrained energy landscapes is a crucial tool for diagnosing model failures and understanding the limitations of uMLIPs, particularly in regions far from equilibrium. The authors claim that this approach provides insights that scalar error metrics cannot, such as identifying spurious minima that could lead to incorrect structure predictions or phase stability assessments. The work underscores the necessity of rigorous benchmarking beyond simple error measures, especially as models become more sophisticated. By offering a framework to track the effects of fine-tuning, transfer learning, and architectural changes, the study aims to support the development of more physically faithful interatomic potentials for reliable materials discovery.

More like this