High-quality, high-information datasets for universal atomistic machine learning

Imagine you are trying to teach a robot chef how to cook any dish in the universe, from a simple salad to a complex, multi-layered cake. To do this, you need a cookbook. But here's the problem: most existing cookbooks are either:

Incomplete: They only have recipes for Italian food (missing Asian, African, etc.).
Inconsistent: One recipe says "bake at 350°F," while another says "bake at 350 degrees" (but uses a different scale), and a third uses a broken oven.
Boring: They only show you perfect, finished dishes, not what happens when you burn the toast or drop the cake on the floor.

If you train your robot on a bad cookbook, it will fail when you ask it to cook something new or when things go wrong in the kitchen.

This paper introduces MAD-1.5, which is essentially a super-charged, universal cookbook for atoms.

The Problem: The "Bad Cookbooks" of Science

Scientists use computers to simulate how atoms behave (like building materials, designing drugs, or creating new batteries). To do this accurately, they use "Machine Learning" (AI). But AI is only as good as the data it learns from.

Previously, the data available was messy:

It focused on specific types of materials (like only metals or only water).
The calculations used to generate the data were done with different rules, leading to contradictions.
It lacked "chaos." It didn't show what happens when atoms are squished together, pulled apart, or heated to extreme temperatures.

The Solution: The "MAD-1.5" Cookbook

The authors created a new dataset called MAD-1.5 (Massive Atomic Diversity 1.5). Think of it as a massive, meticulously organized library containing 216,000 different atomic "recipes."

Here is what makes it special:

1. It Covers Everyone (The Periodic Table)
While old cookbooks might only cover 85 ingredients, this one covers 102 elements from the Periodic Table. It includes everything from common stuff like Carbon and Iron to rare, heavy elements like Uranium. It even includes every isotope (slightly different versions of elements) that lasts longer than a day.

2. It Uses One Strict Rulebook (Consistency)
In the past, different scientists used different "ovens" (mathematical formulas) to calculate how atoms interact. This created confusion.
For MAD-1.5, the team used one single, high-precision oven (called the r2SCAN functional) for every single calculation. This ensures that the "flavor" of the data is consistent from start to finish.

3. It Includes the "Disasters" (Robustness)
Most datasets only show atoms in their happy, relaxed state. MAD-1.5 deliberately includes:

Dimers and Trimers: Pairs and trios of atoms that rarely exist in nature, to teach the AI how atoms behave when they are just starting to bond.
Chaos: Structures that are stretched, squashed, or heated to the point of melting. This teaches the AI how to handle "emergency situations" without crashing.

4. The "Quality Control" Filter
Even with a strict rulebook, sometimes the computer oven glitches. The authors used a smart "quality control" system (an AI that checks its own work) to find and throw out any "recipes" that were calculated incorrectly. They even published the "rejected recipes" so other scientists can study why they failed.

The Result: The "Universal Chef" (PET-MAD-1.5)

Using this perfect dataset, the authors trained a new AI model called PET-MAD-1.5.

Think of this AI as a universal chef that can now:

Predict how a new material will behave before it's even built.
Simulate what happens to a material at 3,000°C (hotter than lava).
Handle a "Mendeleev Cluster"—a giant ball of atoms containing one of every single element in the periodic table.

The Ultimate Stress Test:
To prove their AI was tough, they simulated a "Mendeleev Cluster." Imagine a ball made of one atom of Gold, one of Oxygen, one of Helium, one of Uranium, etc., all mixed together. They heated it up and shook it around.

Old AI models: Would likely explode or give nonsense answers because they've never seen such a chaotic mix.
PET-MAD-1.5: Stayed calm. It correctly predicted that the noble gases (like Helium) would float away, while the heavy metals would clump together. It survived the simulation with high accuracy.

Why Should You Care?

This isn't just about fancy science; it's about accelerating discovery.

New Batteries: We can simulate thousands of new battery materials in seconds to find the one that charges faster and lasts longer.
New Drugs: We can understand how complex molecules interact with the human body more accurately.
Clean Energy: We can design better materials for capturing carbon or splitting water for hydrogen fuel.

By providing a clean, consistent, and massive "textbook" for atoms, the authors have given the scientific community a powerful new tool to solve some of the world's hardest engineering problems. They didn't just build a bigger dataset; they built a better foundation for the future of materials science.

Here is a detailed technical summary of the paper "High-quality, high-information datasets for universal atomistic machine learning."

1. Problem Statement

The practical utility of machine-learned interatomic potentials (MLIPs) for atomistic simulations is often limited by the quality, consistency, and information content of their training data. Existing widely used electronic-structure databases suffer from three primary deficiencies:

Goal Mismatch: Many are assembled for materials screening rather than robust force-field learning, leading to a dominance of near-equilibrium structures that fail to constrain models in distorted, high-force, or close-contact regimes.
Inconsistency: Datasets assembled over time or from multiple sources often employ inconsistent Density Functional Theory (DFT) settings (e.g., varying functionals, numerical thresholds, or magnetic treatments), introducing subtle but significant noise.
Limited Scope: Many datasets are restricted to specific classes of compounds (e.g., only molecules or only bulk crystals) or lack coverage of the full periodic table, hindering the development of truly universal models.

2. Methodology

A. Dataset Construction: MAD-1.5

The authors introduce MAD-1.5, a highly curated dataset designed to extend the previous "Massive Atomic Diversity" (MAD-1) concept.

Scope: The dataset contains 216,803 atomic structures spanning 102 chemical elements (covering every isotope with a half-life > 1 day).
Composition: It is organized into 14 subsets, combining inherited structures from MAD-1 with new, targeted additions:
- Inherited: Bulk crystals, rattled configurations, surfaces, clusters, 2D materials, and molecular fragments.
- New Enrichment:
  - Monomers, Dimers, and Trimers: Explicitly sample 1-body, 2-body, and 3-body interactions across all element pairs to anchor fundamental interatomic forces and low-coordination environments.
  - MC3D-extended & MC3D-random-extended: Expanded crystal structures including previously missing lanthanides and actinides.
  - Binary-random: Substitutional orderings on BCC/FCC lattices to sample diverse chemical pairings.
Electronic Structure Standardization:
- Code: All calculations use FHI-aims (all-electron, numeric atom-centered orbitals).
- Functional: The r2SCAN meta-GGA functional is used uniformly. This offers a superior balance of accuracy (improving formation enthalpies and hydrogen bonds over GGAs) and computational cost compared to the PBEsol used in MAD-1.
- Convergence: Strict thresholds are enforced ($10^{-6} $eV for energy,$ 10^{-4} $eV/Å for forces) with an 8 Å$ ^{-1}$ k-point grid and Gaussian smearing.
- Note: Calculations are non-spin-polarized to ensure internal consistency, acknowledging this as a trade-off for universality in the absence of a universally convergent spin-polarized framework.

B. Outlier Detection and Cleaning

To ensure data reliability, a two-step cleaning procedure was applied:

Heuristic Filter: Removal of structures with force magnitudes > 100 eV/Å to prevent overfitting to unphysical repulsive regimes.
Uncertainty-Based Filter (LLPR): A preliminary ML potential was trained on the heuristically cleaned data. Using the Last-Layer Prediction Rigidity (LLPR) method, structures where the actual prediction error exceeded 3 times the predicted uncertainty were identified and removed. This eliminated 8,244 structures likely suffering from DFT convergence issues or local minima.

C. Model Architecture and Training

The authors trained PET-MAD-1.5, a general-purpose MLIP based on the Point Edge Transformer (PET) architecture (a rotationally unconstrained Graph Neural Network).

Strategy: The models were fine-tuned from pre-trained weights on the massive OMat24 dataset (PBE level) using the high-quality MAD-1.5 (r2SCAN level) data.
Training Targets: Atomization energies, atomic forces, and stresses.
Architecture Variants: Two models were released:
- XS: ~4.5M parameters (faster inference).
- S: ~25.9M parameters (higher accuracy).
Loss Function: Weighted sum of RMS errors for energy, forces, and stresses, including non-conservative heads for accelerated inference.

3. Key Contributions

MAD-1.5 Dataset: A compact (~200k structures) but chemically exhaustive dataset covering 102 elements with internal consistency achieved via a single, standardized r2SCAN workflow.
Targeted Enrichment: The inclusion of explicit monomers, dimers, and trimers ensures the model correctly captures fundamental few-body interactions and the zero-density limit, which are often under-represented in bulk-only datasets.
Robust Outlier Removal: The application of LLPR-based uncertainty quantification to filter DFT convergence failures, providing a "clean" dataset and a separate set of "hard" cases for stress-testing DFT implementations.
PET-MAD-1.5 Models: Demonstration that fine-tuning a foundation model on a small, high-quality, consistent dataset yields superior performance compared to training on massive but noisy datasets.

4. Results and Benchmarks

A. In-Domain Performance

On the MAD-1.5 test set, the PET-MAD-1.5-S model achieved:

Mean Absolute Error (MAE) in Forces: 37 meV/Å (overall), with errors generally below 70 meV/Å for "reasonable" configurations.
MAE in Energy: 11.09 meV/atom.
The XS model achieved comparable accuracy to the original PET-MAD-1 (trained on PBEsol) but with a much broader chemical scope and faster inference.

B. Out-of-Domain (OOD) Generalization

Evaluated on the MADBench (diverse datasets like MatBench, Alexandria, OC2020, SPICE, MD22):

The models maintained high accuracy across different material classes and DFT settings (when targets were re-computed with r2SCAN).
Force errors remained below 70 meV/Å on most OOD subsets, demonstrating exceptional transferability.
Performance significantly outperformed the original PET-MAD-1 (v1.0.2) and other state-of-the-art models trained on PBE or PBEsol data.

C. Stability Stress Test: "Mendeleev Clusters"

A rigorous stability test involved simulating a nanoparticle containing one atom of each of the 102 elements (a "Mendeleev cluster") via Replica Exchange Molecular Dynamics (REMD) from 300 K to 3000 K.

Result: The simulation remained stable for 1.6 ns total trajectory.
Physical Behavior: The cluster formed a quasi-spherical particle, expelling noble gases at low temperatures and exhibiting surface irregularities at high temperatures.
Accuracy: Comparison with single-point r2SCAN calculations on final structures yielded a force MAE of ~150 meV/Å, confirming the model's stability in extreme, high-energy, and multi-element environments where first-generation MLIPs typically fail.

5. Significance

Universal Applicability: MAD-1.5 and PET-MAD-1.5 represent a significant step toward a truly universal interatomic potential capable of handling the entire periodic table at a high level of theory (r2SCAN).
Quality over Quantity: The work demonstrates that a smaller, carefully curated, and internally consistent dataset yields higher accuracy and stability than massive, heterogeneous datasets.
Benchmarking Standard: The dataset provides a new standard for testing the robustness of MLIPs in challenging regimes (high temperature, low coordination, extreme chemical diversity).
Open Science: Both the dataset (Materials Cloud) and the models (GitHub) are publicly released, facilitating reproducibility and further development in the field of atomistic machine learning.

High-quality, high-information datasets for universal atomistic machine learning

The Problem: The "Bad Cookbooks" of Science

The Solution: The "MAD-1.5" Cookbook

The Result: The "Universal Chef" (PET-MAD-1.5)

Why Should You Care?

1. Problem Statement

2. Methodology

A. Dataset Construction: MAD-1.5

B. Outlier Detection and Cleaning

C. Model Architecture and Training

3. Key Contributions

4. Results and Benchmarks

A. In-Domain Performance

B. Out-of-Domain (OOD) Generalization

C. Stability Stress Test: "Mendeleev Clusters"

5. Significance

More like this

Static and Dynamic Disorder in Formamidinium Lead Bromide Single Crystals

Triggered ferroelectricity in HfO2_22​ from hybrid phonons and higher-order dynamical charges

Functional Unit: A New Perspective on Materials Science Research Paradigms

Tunable Octdong and Spindle-Torus Fermi Surfaces in Kramers Nodal Line Metals

Spin-polarized triplet excitonic insulators in Ta3X8 (X=I, Br) monolayers

Triggered ferroelectricity in HfO $_2$ from hybrid phonons and higher-order dynamical charges