Systematic Fine-Tuning of MACE Interatomic Potentials… — Plain-Language Explanation

Original authors: Nima Karimitari, Jacob Clary, Derek Vigil-Fowler, Ravishankar Sundararaman, Gábor Csányi, Christopher Sutton

Published 2026-05-12

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

CC BY 4.0

Original authors: Nima Karimitari, Jacob Clary, Derek Vigil-Fowler, Ravishankar Sundararaman, Gábor Csányi, Christopher Sutton

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a computer to predict how chemicals react on a catalyst (a material that speeds up reactions, like a spark plug for a car engine). To do this, the computer needs a "map" of the energy landscape, showing where the hills (barriers to reaction) and valleys (stable states) are.

Traditionally, drawing this map requires incredibly slow and expensive supercomputer calculations (called DFT). Machine Learning Interatomic Potentials (MLIPs) are like a shortcut: they are smart AI models that learn to draw this map almost instantly, with near-perfect accuracy.

This paper is a guide on how to train these AI models most effectively. The authors tested two main ways to teach the AI: starting from zero ("From-Scratch") and giving it a head start with a pre-trained "foundation" model ("Fine-Tuning").

Here is the breakdown of their findings using simple analogies:

1. The Two Training Strategies

Strategy A: From-Scratch (FS) – "The Blank Slate"
Imagine trying to teach a student to navigate a city by only showing them the main streets where people live (relaxed, stable structures).

The Problem: If you only show them the main streets, they get lost when they encounter a construction zone or a detour (high-energy, unstable states where bonds are breaking).
The Fix: The authors found that for these "blank slate" models, you must show them the "construction zones." By adding data from simulations that shake the atoms around (Molecular Dynamics) or force them to stretch until they almost break (Contour Exploration), the model learns the tricky parts of the map.
Result: Without these "chaos" examples, the model makes big mistakes. With them, the error drops by more than half.

Strategy B: Fine-Tuning (FT) – "The Expert Intern"
Imagine hiring a student who has already graduated from a top university with a degree in chemistry (a pre-trained "Foundation Model" called MACE-MH-1). They already know the general layout of the world.

The Advantage: You don't need to show them every single street or every construction zone. You just need to give them a specific handbook for the neighborhood you care about (e.g., metal catalysts or metal oxides).
The Result: These "interns" are much more robust. Even if you only give them a small amount of specific data, they perform better than the "blank slate" students, even on reactions they haven't seen before (out-of-distribution). They are less sensitive to how you collect the data.

2. The "Cross-Training" Surprise

One of the coolest findings is that these models can learn from one type of material and apply it to another.

The Analogy: It's like teaching a chef how to cook steak (metal catalysts) and then asking them to cook a vegetable dish (metal-oxide catalysts).
The Finding: The authors found that if you fine-tune the model on metal data, it surprisingly gets really good at predicting reactions on metal oxides, even if it never saw metal oxides in its specific training set. Conversely, training on metal oxides helped it predict reactions on metals.
Why? The model learned the fundamental "physics" of how atoms bond and break, which applies to both materials.

3. The "Super-Model" and the Big Screen

The authors combined all their best data to create one "Super-Model" (FT-All).

The Test: They used this model to screen a massive library of 90,781 different chemical combinations (binary alloys) to see which ones might be good catalysts.
The Outcome: The model was incredibly accurate, with an error rate of just 0.15 eV (a very small margin of error in this field). It successfully predicted how chemicals would stick to surfaces it had never seen before, including complex, jagged surfaces (high Miller index surfaces) that are hard to model.

4. Why This Matters for Real Reactions

The paper tested these models on real-world, difficult chemical reactions:

CO2 Reduction: Turning carbon dioxide into useful fuels (like ethylene or ethanol).
Oxygen Evolution: A key step in making clean energy (splitting water).
Propane Dehydrogenation: Making plastics.

In almost every case, the Fine-Tuned models outperformed the From-Scratch models. They were faster to train, required less data, and were more accurate at predicting the "energy barriers" (the height of the hill the reaction has to climb).

The Bottom Line

If you want to build a machine learning model for catalysis:

Don't start from scratch unless you have a massive, chaotic dataset full of broken bonds and high-energy states.
Start with a pre-trained foundation model and "fine-tune" it with a smaller, targeted dataset. It's like giving a smart intern a specific project rather than hiring a novice and teaching them everything from day one.
Don't worry too much about perfect data diversity if you are fine-tuning; the foundation model has already learned the general rules of the universe.

This work provides a roadmap for scientists to build better, faster, and more accurate AI tools to discover new catalysts for clean energy and sustainable chemistry.

Technical Summary: Systematic Fine-Tuning of MACE Interatomic Potentials for Catalysis

Problem Statement
Accurately predicting catalytic reaction pathways requires determining two key quantities: reaction energies ( $E_r$ ) and activation energy barriers ( $E_a$ ). While Density Functional Theory (DFT) is the standard for these calculations, it becomes computationally prohibitive for complex catalyst-adsorbate combinations involving diverse binding sites, defect states, and surface disorder. Machine-learned interatomic potentials (MLIPs) offer a solution by providing DFT-level accuracy at a fraction of the computational cost. However, the performance of MLIPs is heavily dependent on the construction of their training sets. Existing "from-scratch" (FS) training strategies often require extensive, diverse sampling (e.g., molecular dynamics, contour exploration) to capture bond-breaking events and high-energy configurations necessary for accurate barrier predictions. Conversely, the efficacy of fine-tuning (FT) large foundation models (such as MACE-MH-1) for specific catalytic applications remains under-explored, particularly regarding their sensitivity to training set diversity and their ability to generalize to out-of-distribution (OOD) reactions.

Methodology
The authors systematically compared the performance of 9 MACE-based MLIPs using two distinct training strategies:

From-Scratch (FS) Training: Models were trained from random initialization on varying subsets of data, including:
- Relaxation trajectories (low-energy configurations).
- Molecular Dynamics (MD) configurations (thermal fluctuations).
- Contour Exploration (CE) configurations (high-energy, bond-breaking events generated by following potential energy contours rather than gradients).
Fine-Tuning (FT) Training: Models started from the pre-trained MACE-MH-1 foundation model. A multi-head replay strategy was employed to prevent catastrophic forgetting, utilizing a "replay" head trained on 30,000 configurations from the OMAT dataset and a second head trained on specific catalytic datasets (metallic, metal-oxide, or mixed).

The models were evaluated on a diverse test set of 141 chemical reactions, including:

Metallic Catalysts: $CO_2$ reduction to $C_2$ and $C_3$ products, propane dehydrogenation, and hydrogen intercalation on palladium.
Metal-Oxide Catalysts: Oxygen evolution reaction (OER) on iridium oxide polymorphs.
Out-of-Distribution (OOD) Tests: Applying models trained on metallic catalysts to metal-oxide reactions and vice versa.
High-Throughput Screening: Predicting adsorption energies for 90,781 configurations on binary transition metal alloys with unseen Miller indices.

Key Contributions and Results

Training Set Requirements for FS Models: The study demonstrates that FS models trained solely on relaxation trajectories perform poorly for barrier predictions. Incorporating 5% to 10% of perturbed, high-energy configurations from MD or CE significantly reduces errors. Specifically, an FS model augmented with CE configurations (FS-All) achieved an $E_a$ Mean Absolute Error (MAE) of 0.175 eV for metallic catalysts, a substantial improvement over models trained only on relaxation data.
Superiority of Fine-Tuning: FT models demonstrated robustness with smaller datasets and less sensitivity to specific sampling techniques.
- An FT model fine-tuned on metallic catalysts (FT-All) achieved an $E_r$ MAE of 0.141 eV for the $CO_2$ reduction $CHCOH^*$ pathway, outperforming the best FS model (0.251 eV).
- For OER on iridium oxides, the FT-All model achieved an MAE of 0.278 eV, significantly lower than the best FS model (0.384 eV) and the base MACE-MH-1 model (0.384 eV).
Cross-System Generalization: FT models exhibited remarkable transferability. A model fine-tuned exclusively on metallic catalysts (without metal-oxide data) successfully predicted OER steps on metal-oxide polymorphs with an MAE of 0.305 eV. Conversely, models fine-tuned on metal-oxides performed well on metallic catalyst NEB calculations. This suggests that the pre-trained foundation model already encodes sufficient chemical knowledge, and fine-tuning on a specific subset (even without explicit bond-breaking events in the target domain) is sufficient for robust performance.
High-Throughput Screening: The largest FT model (trained on 49,860 configurations) was used to screen 90,781 adsorption energies on bimetallic alloys. It achieved an overall MAE of 0.15 eV, even for adsorbates on unseen high-index surfaces (e.g., (532) facets), demonstrating its utility for rapid catalyst discovery.

Significance and Claims
The authors claim that this work identifies the necessary training set configurations for both FS and FT approaches, highlighting a shift in strategy for catalytic MLIP development. The primary significance lies in demonstrating that fine-tuning large foundation models is a more efficient and accurate pathway than training from scratch for catalytic applications.

Key claims include:

Efficiency: FT models can achieve state-of-the-art accuracy with significantly smaller datasets compared to FS models, reducing the computational cost of generating training data.
Generalizability: FT models trained on one class of catalysts (e.g., metals) can effectively predict reactions on another class (e.g., metal-oxides), provided the foundation model has been pre-trained on a diverse dataset.
Robustness: Unlike FS models, which require explicit inclusion of bond-breaking events (via CE or MD) to predict barriers accurately, FT models leverage the pre-existing chemical knowledge of the foundation model to handle OOD reactions and transition state searches without specialized sampling techniques.

The paper concludes that while FS models require diverse, high-energy configurations to reduce errors, FT models offer a more generalizable and accurate solution for screening catalytic reactions across metallic and metal-oxide systems.

Systematic Fine-Tuning of MACE Interatomic Potentials for Catalysis

1. The Two Training Strategies

2. The "Cross-Training" Surprise

3. The "Super-Model" and the Big Screen

4. Why This Matters for Real Reactions

The Bottom Line

More like this