🔬 materials science

Performance of universal machine learning potentials in global optimization

This paper systematically benchmarks the latest generation of universal machine learning potentials in unconstrained global optimization tasks, revealing a wide performance spectrum from near ab initio accuracy to non-predictive results while demonstrating that several models can successfully capture subtle electronic structure features to identify complex crystal ground states.

Original authors: Edan T. Marcial, Laxman Chaudhary, Olesya Gorbunova, Aleksey N. Kolmogorov

Published 2026-03-02

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Edan T. Marcial, Laxman Chaudhary, Olesya Gorbunova, Aleksey N. Kolmogorov

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to find the absolute best way to stack a massive pile of LEGO bricks to build a stable castle. In the world of materials science, these "bricks" are atoms, and the "castle" is a new crystal structure.

For decades, scientists used a super-precise, but incredibly slow, method to figure out the best stacking order. It's like trying to solve a Rubik's cube by moving one tiny piece at a time and checking the physics of every single move. This is called Density Functional Theory (DFT). It's accurate, but it takes so much computing power that you can only check a few million combinations in a lifetime.

Enter Machine Learning Potentials (MLPs). Think of these as "smart shortcuts." Instead of calculating the physics from scratch every time, the computer learns from a massive library of previous calculations. It becomes a "crystal intuition" engine that can guess the energy of a structure almost instantly.

Recently, scientists developed Universal Machine Learning Potentials (uMLPs). These are like "all-in-one" apps. Instead of training a specific app just for "Iron Bricks" or "Carbon Bricks," these models are trained on everything in the periodic table. The hope is that you can just download the app and start building any kind of crystal, anywhere, without needing to customize it first.

The Big Test: The "Unconstrained" Challenge

The authors of this paper asked a tough question: Do these "all-in-one" apps actually work when you let them run wild?

Usually, these apps are tested on structures we already know (like a pre-made LEGO set). But in real discovery, scientists want to find new structures that no one has ever seen before. This is called Global Optimization. It's like telling the computer, "Here are some atoms, build me the most stable castle you can, and don't give me any hints about what it should look like."

The researchers took nine of the latest, most popular "all-in-one" apps (models like M3GNet, MACE, SevenNet, etc.) and let them loose on 12 different chemical systems. They wanted to see if these models could find the true "ground state" (the most stable, lowest-energy structure) or if they would get lost in the weeds.

The Results: A Tale of Two Models

The results were a mix of "Wow!" and "Uh oh."

1. The Star Performers:
Some models, particularly eSEN and SevenNet, were like expert master builders. They could navigate the complex landscape of atoms, find the hidden valleys where the most stable structures hide, and distinguish between very similar-looking designs. They were so good that they could even spot subtle electronic tricks that nature uses to stabilize certain metals.

2. The Strugglers:
Other models, like the older M3GNet, were a bit like a confused tourist. They often got stuck in "fake" valleys—structures that looked stable to the model but were actually nonsense in the real world. In some cases, they completely missed the best structure.

3. The "Hallucinations":
One funny (but serious) failure happened with a compound called Silver Perchlorate ( $AgClO_4$ ). The models kept trying to build structures with floating pairs of oxygen atoms ( $O_2$ ) inside the solid. It's like the LEGO AI decided that two bricks glued together in mid-air was a valid part of the castle! The models just hadn't seen enough examples of how oxygen behaves in solids to know that this was a bad idea.

The "Surprise" Discoveries

Because the researchers let the models run so freely, they accidentally found something new.

The "Better" Na2CN2: One model found a new way to pack Sodium Cyanide that seemed more stable than the known version, but only when using one specific type of physics calculation. It turned out to be a fluke of that specific calculation method, not a real new material.
The "Hidden" MgB3C3: Another model found a new structure for a Magnesium-Boron-Carbon mix that was more stable than the previously known "superconductor" candidate. This suggests that if we can make this material, it might have even cooler properties than we thought.

The "Tricky" Cases: When Physics Gets Weird

The paper also tested the models on three "tricky" scenarios where the atoms behave strangely due to their electronic structure:

The Stretchy Zinc: Zinc atoms usually pack in a perfect hexagon, but in reality, they stretch out weirdly. Most models failed to predict this stretch, treating it like a normal hexagon. Only one model got it right.
The Shapeshifting Borides: Some metal-boron compounds can twist into different shapes depending on the metal used. The best models could predict these twists; the others just saw the "default" shape and missed the subtle changes.
The Off-Recipe Lithium: Lithium and Boron usually mix in a perfect ratio, but sometimes they mix in a weird, off-ratio way. The models surprisingly got this right, correctly predicting that the "weird" mix is actually the most stable one.

The Bottom Line

This paper is a massive "stress test" for the new generation of AI tools in materials science.

The Good News: We are getting very close. The best models are now good enough to act as a "first draft" for discovering new materials. They can do in minutes what used to take weeks of supercomputer time.

The Bad News: They aren't perfect yet. They can still get confused by weird chemistry or "hallucinate" impossible structures.

The Takeaway: You can't just download an "all-in-one" app and expect it to be 100% right every time. You still need a human expert (or a final check with the slow, precise physics method) to verify the results. However, these tools are powerful enough to narrow down the search from "finding a needle in a haystack" to "finding the needle in a small box."

In short: The AI is a brilliant apprentice, but it still needs a master builder to double-check its work before we start building the real thing.

1. Problem Statement

Machine Learning Interatomic Potentials (MLPs) have revolutionized materials simulation by offering DFT-level accuracy at a fraction of the computational cost. While Universal MLPs (uMLPs)—models trained on massive, diverse datasets—have shown promise in property prediction and local optimization, their reliability in unconstrained global structure searches remains unverified.

Global optimization is a demanding application because it explores vast regions of the potential energy surface (PES), often probing motifs and configurations absent from the training data. The core challenge is determining whether current uMLPs can:

Robustly identify the true ground state among competing phases without being misled by spurious local minima.
Accurately resolve fine energy differences arising from subtle electronic structure features (e.g., band topology, Peierls distortions).
Generalize across diverse chemistries and bonding types without system-specific retraining.

2. Methodology

The authors conducted a systematic benchmark of nine state-of-the-art uMLPs using an evolutionary algorithm framework.

Models Evaluated: Nine models spanning various architectures and training sets: M3GNet (MG), MACE (MC), SevenNet (SN), EquiformerV2 (EQ), MatterSim (MS), GRACE (GR), eSEN (EN), Orb-v3 (OR), and PET-MAD (PT).
Search Protocol:
- Algorithm: An evolutionary search (MAISE) was used with populations of 100 structures evolved over 100 generations.
- Surrogate Role: uMLPs acted as "out-of-the-box" surrogate models to rank and relax structures.
- Validation: Low-energy candidates identified by uMLPs were re-optimized using reference DFT methods (primarily PBE, with comparisons to PBEsol and r2SCAN).
- Metrics: Performance was assessed using Ranking RMSE (root mean square error of relative energies in low-energy pools), Energy Proximity (deviation of uMLP minima from DFT minima), and Structural Proximity (fingerprint similarity).
Test Systems:
- 12 Inorganic Compounds: Including standard systems (e.g., TiO2, Si3CaPt) and recently proposed ground states with complex motifs (Li3Sn, Pd5Sn3, MgB3C3).
- Three Challenging Cases:
  1. hcp-Zn: Testing the ability to reproduce an anomalous $c/a$ ratio driven by electronic band topology.
  2. MB4 (M = Cr, Mn, Fe): Testing the resolution of competing polymorphs involving symmetry breaking and Peierls distortions.
  3. LiBy ( $y \approx 0.9$ ): Testing off-stoichiometric phases not present in major databases, relying on subtle charge transfer and orbital effects.

3. Key Contributions

First Systematic Benchmark for Unconstrained Searches: Unlike previous studies focusing on local optimization or property prediction, this work specifically targets the "black box" capability of uMLPs in discovering new crystal structures without prior structural input.
Comprehensive Metric Suite: The authors introduced a rigorous protocol involving merged candidate pools and specific metrics (Ranking RMSE) that isolate the model's ability to rank competing low-energy states, removing biases from average energy shifts.
Discovery of New Phases: The benchmarking process itself led to the discovery of two potentially more stable phases than previously reported:
- tI10-Na2CN2: A less dense packing of CN2 units (though likely a PBE-specific artifact).
- oI28-MgB3C3: A 3D-connected BC framework that is more stable than the previously proposed layered honeycomb superconductor across all tested DFT functionals.
Performance Stratification: The study categorizes uMLPs not just by accuracy, but by their suitability for specific types of global exploration tasks.

4. Results

A. Global Search Success Rates

High Performers: eSEN (EN) and SevenNet (SN) demonstrated the highest success rates (92%), successfully locating ground states for 11 out of 12 compounds. EquiformerV2 (EQ) also performed well (75% success) but generated larger pools of low-symmetry candidates, suggesting it explores the PES more broadly.
Low Performers: M3GNet (MG) performed poorly (17% success), failing to identify ground states for most compounds and producing unphysical minima.
General Trend: Larger models with more expressive descriptors (e.g., EQ, EN) generally outperformed smaller, earlier architectures.
Failure Mode: The only systematic failure across all models was AgClO4, where models failed to disfavor molecular $O_2$ dimers, indicating a lack of training data for specific oxygen bonding environments.

B. Quantitative Metrics

Ranking RMSE: The best models (EN, EQ) achieved ranking RMSEs between 5–7 meV/atom, significantly lower than the typical ~20 meV/atom error reported in general benchmarks (e.g., MatBench). This indicates that uMLPs are highly accurate in the relative energy landscape near the ground state, even if absolute energies vary.
Energy Proximity: eSEN showed the smallest deviations in energy and structural fingerprints upon re-optimization with DFT.

C. Case Study Specifics

hcp-Zn (Electronic Anomaly): Most uMLPs failed to reproduce the shallow energy basin associated with the anomalous $c/a$ ratio (1.826). Only SevenNet (SN) closely followed the DFT energy profile. Others either favored the ideal packing or introduced discretization artifacts.
MB4 (Symmetry Breaking): eSEN and EquiformerV2 were the only models to correctly identify the distorted ground states (mP20 for Mn, oP10 for Cr/Fe) and quantify the stabilization energy within 10 meV/atom.
LiBy (Off-Stoichiometry): All uMLPs correctly placed the stability minimum in the Li-rich region ( $y \approx 0.9$ ), despite this phase being underrepresented in training data. However, they underestimated the energy difference between $\alpha$ and $\beta$ phases and the curvature of the stability parabola.

5. Significance and Conclusion

Shift in Paradigm: The results suggest that system-specific potentials are no longer strictly necessary for initial crystal structure exploration. Pre-trained uMLPs can serve as robust starting points for discovering new thermodynamically stable materials.
Model Selection: The study highlights that eSEN currently offers the most consistent performance across diverse tasks, balancing global search success with the ability to capture subtle electronic effects. M3GNet, despite its popularity, is shown to be insufficient for unconstrained global optimization in complex systems.
Future Directions: While uMLPs are powerful, the study emphasizes the need for:
- Targeted Retraining: To fix specific failures (e.g., oxygen bonding in AgClO4).
- Hybrid Workflows: Using uMLPs for broad exploration followed by DFT re-optimization of a small candidate pool.
- Data Diversity: The inclusion of specific electronic structure features and off-stoichiometric phases in training sets is crucial for resolving fine energy differences.

In conclusion, this work establishes that the latest generation of uMLPs has reached a level of maturity where they can effectively accelerate the discovery of complex inorganic materials, provided that their limitations in specific electronic regimes are acknowledged and managed.