Fine-tuning MLIP foundation models: strategies for… — Plain-Language Explanation

Original authors: Tamás Lajos Tompa, Eszter Varga-Umbrich, Ilyes Batatia, Alin M. Elena, Noam Bernstein, Gábor Csányi

Published 2026-06-12

📖 5 min read🧠 Deep dive

Original authors: Tamás Lajos Tompa, Eszter Varga-Umbrich, Ilyes Batatia, Alin M. Elena, Noam Bernstein, Gábor Csányi

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a master chef who has spent years learning to cook perfect meals using only inorganic ingredients like rocks, metals, and salts. This chef is a "Foundation Model." Now, you want this chef to cook a specific new dish, like a delicate organic soup or a biological stew, using a very small amount of new recipes.

The big question is: How do you teach this chef the new dish without making them forget how to cook the old ones, or without ruining their existing skills?

This paper is a massive kitchen experiment testing seven different ways to "fine-tune" (retrain) this master chef. The researchers found that the method of teaching matters less than three critical "pre-cooking" steps: choosing the right chef, setting the right baseline, and tuning the heat.

Here is the breakdown of their findings in simple terms:

1. The Three "Pre-Flight" Checks (The Most Important Part)

Before you even start teaching the new recipe, you must get three things right. If you mess these up, no teaching method will save you.

Pick the Right Chef (Foundation Model Quality):
- The Analogy: You wouldn't hire a chef who only knows how to boil water to teach you how to bake a soufflé.
- The Finding: The quality of the original model matters more than the fine-tuning strategy. A model trained on a huge, diverse dataset of inorganic materials (like the "OMat24" model) is much better at learning new, weird chemistry than an older, smaller model. Even if you use the same teaching method, a "better" foundation model will always produce a better final dish.
Set the Zero Point (Atomic Reference Energy / $E_0$ ):
- The Analogy: Imagine measuring the height of a building. If you start measuring from the basement instead of the ground floor, your numbers will be wrong, and the building might look like it's floating or buried. In chemistry, you need to subtract the "weight" of the individual atoms so the model only learns about how they interact.
- The Finding: The researchers found that using a smart, "model-aware" way to set this zero point is crucial. If you use a lazy, average guess, the model becomes unstable. It might look good on paper (low error scores) but will fall apart when you try to simulate real-world physics (like a building collapsing in a wind tunnel test).
Turn Down the Heat (Hyperparameters):
- The Analogy: When learning a new skill, you don't want to move so fast that you trip, but you don't want to move so slow that you never finish.
- The Finding: Different teaching methods need different "learning rates." For example, a method called LoRA (which only changes a tiny part of the model) can handle a very fast learning rate, while a method that teaches two things at once needs a very slow, gentle pace.

2. The Seven Teaching Strategies

Once the three checks above are passed, the researchers tested seven ways to teach the new recipe:

Naive Fine-Tuning: "Just keep cooking." You take the whole chef and keep training them on the new data.
- Result: Great for learning one specific dish perfectly. But if you try to use this chef for a different type of food later, they might have forgotten their old skills (a problem called "catastrophic forgetting").
Layer Freezing: "Don't touch the basics." You lock the chef's knowledge of basic knife skills and only let them learn the new sauce.
- Result: Good, but sometimes too rigid. It limits how well the chef can adapt to the new ingredients.
LoRA (Low-Rank Adaptation): "Add a cheat sheet." Instead of rewriting the whole cookbook, you add a small, efficient note-pad to the chef's apron that only covers the new rules.
- Result: Very efficient and accurate for specific tasks, similar to Naive tuning.
Multihead Replay: "The Dual-Head Chef." You give the chef two hats. One hat is for the new dish, and the other hat is for the old, familiar dishes. They practice both at the same time.
- Result: This is the winner for safety. It's the only method that consistently prevents the chef from forgetting their old skills. It keeps the chef good at the new dish and the old ones.
Pseudolabel Replay: "The Synthetic Chef." Instead of using real old recipes, you use the chef's own predictions of old recipes to practice.
- Result: Works well and is flexible because you don't need the original old data, just the chef's memory.
Replay + LoRA: Combining the cheat sheet with the dual hats.
- Result: Good, but the "Dual Head" alone was often enough.

3. The Big Takeaways

Don't Reinvent the Wheel: If you need a model for a specific, narrow task (like just simulating salt water), Naive Fine-Tuning is the fastest and easiest way to get a great result.
Don't Forget the Past: If you need a model that can handle weird, new situations (like a new type of battery or a complex biological molecule) without forgetting its original training, you must use Multihead Replay. It's the only strategy that kept the model robust and safe from "forgetting."
Quality Over Tricks: The paper emphasizes that spending time picking a high-quality foundation model and setting the energy references correctly is more important than choosing the perfect fine-tuning algorithm. If the foundation is weak or the math is set up wrong, the best teaching strategy in the world won't help.

In short: To get the best AI for chemistry, start with a smart foundation, set your math rules correctly, and if you want the AI to be versatile and not forgetful, teach it using the "Dual Head" method (Multihead Replay).

Technical Summary: Fine-tuning MLIP Foundation Models

Problem Statement
Machine-learned interatomic potential (MLIP) foundation models have demonstrated the ability to transfer across diverse chemical systems, offering a workflow that avoids the resource-intensive process of training task-specific potentials from scratch. However, the community lacks systematic guidance on how and when to fine-tune these models. Early reports suggested that naive fine-tuning often leads to "catastrophic forgetting," prompting the adoption of constrained techniques (e.g., layer freezing, Low-Rank Adaptation) originally developed for large language models. This paper investigates whether these constraints are necessary or if early failures were due to other factors, such as weaker foundation models, improper atomic reference energy ( $E_0$ ) initialisation, and unstable training procedures. The study aims to characterize the major factors shaping fine-tuning outcomes, specifically target-task accuracy and out-of-distribution (OOD) robustness.

Methodology
The authors evaluate seven distinct fine-tuning strategies across five chemically diverse benchmarks, three generations of foundation models, and training sets spanning five orders of magnitude in size.

Fine-tuning Strategies Evaluated:
1. Naive: Full parameter updates via continued gradient descent.
2. Layer Freezing (Variants): Freezing embedding/message-passing layers while training readouts; or freezing embedding and the first message-passing layer.
3. Low-Rank Adaptation (LoRA): Injecting trainable low-rank decompositions into both scalar and equivariant linear layers while freezing pretrained weights.
4. Multihead Replay: Simultaneous optimization on target data and a replay dataset (from pretraining or pseudolabeled) using separate readout heads.
5. Pseudolabel Replay: A variant of multihead replay where replay labels are generated by the foundation model itself, decoupling the replay source from the original pretraining corpus.
6. Replay + LoRA: Combining multihead replay with LoRA.
Benchmarks: The study spans systems with increasing departure from the OMat24 pretraining domain (periodic inorganic bulk):
- Lithium argyrodite electrolytes (inorganic periodic solid).
- Aqueous NaCl (ionic solution).
- Ice polymorphs (molecular solid).
- SN2 reactions (gas-phase reactive chemistry).
- SPICE biomolecules (organic/biomolecular conformers).
Technical Implementations: The authors implemented three new capabilities in the MACE codebase:
- LoRA adapted for equivariant message-passing architectures (covering both scalar and equivariant linear layers).
- Pseudolabelled replay to decouple replay data sources.
- Model-aware atomic reference energy ( $E_0$ ) reestimation to align pretrained baselines with target data.
Evaluation Metrics: Beyond standard pointwise energy and force errors, the study probes dynamic and extrapolative behaviors, including radial distribution functions (RDFs) from molecular dynamics (MD), Nudged Elastic Band (NEB) reaction profiles, MD stability tests, and Random Structure Search (RSS) to detect short-range repulsion failures.

Key Results

Prerequisites Dominate Strategy Choice: The study finds that foundation model quality, correct $E_0$ initialisation, and well-chosen hyperparameters are prerequisites whose impact routinely exceeds that of the specific fine-tuning strategy.
- Foundation Quality: Newer foundation models (e.g., OMat24-based) consistently outperform older ones (MPTraj-based) in OOD transfer, even with fixed fine-tuning recipes.
- $E_0$ Initialisation: Using "averaged" $E_0$ values leads to significantly higher errors and MD instability (e.g., ice models failing within 50 ps). "Reestimated" $E_0$ s (aligning the pretrained model's zero-point to the target data) are critical for stability and transferability, often yielding better results than the choice of fine-tuning algorithm itself.
- Hyperparameters: Naive fine-tuning requires reduced learning rates and increased EMA decay. LoRA tolerates higher learning rates. Multihead replay requires substantially lower learning rates to avoid competing update signals. Weight decay should be set to zero to prevent pulling parameters away from the pretrained solution.
Performance by Objective:
- In-Distribution Specialisation (Single System): For narrow tasks (e.g., SN2 barriers, aqueous NaCl solvation), most strategies (Naive, LoRA, Multihead) achieve strong accuracy, consistently surpassing models trained from scratch. Naive fine-tuning offers the best convergence for single-system applications.
- Out-of-Distribution Robustness: When evaluating transfer to related but unseen compositions (e.g., non-argyrodite electrolytes) or different chemistries (e.g., biomolecules), Multihead Replay (with either original or pseudolabelled data) is the only approach that consistently preserves OOD robustness. It maintains accuracy on the pretraining distribution while learning the target task, effectively preventing catastrophic forgetting.
- Freezing and LoRA: While effective for parameter efficiency, layer freezing and LoRA showed limitations in adapting to solvation features or maintaining broad chemical robustness compared to multihead replay in the tested scenarios.

Significance and Claims
The paper claims that the perceived fragility of naive fine-tuning in MLIPs is largely a result of suboptimal setup rather than an intrinsic limitation of the method. The authors argue that:

Naive fine-tuning is a viable and often superior starting point for single-system applications, provided the foundation model is high-quality and $E_0$ s are correctly reestimated.
Multihead replay is the necessary strategy for broader deployment where preserving the foundation model's behavior outside the fine-tuning distribution is required.
Pseudolabelled replay offers a practical advantage by allowing the use of any structurally diverse dataset for replay, removing the dependency on access to the original pretraining corpus.

The work establishes that for practitioners, investing in the strongest available foundation model and ensuring correct atomic reference energy alignment are more critical design choices than selecting a specific constrained fine-tuning algorithm. The study provides a systematic framework for deploying MLIP foundation models, moving fine-tuning from a niche option to a default starting point for system-specific development.

Fine-tuning MLIP foundation models: strategies for accuracy and transferability

1. The Three "Pre-Flight" Checks (The Most Important Part)

2. The Seven Teaching Strategies

3. The Big Takeaways

More like this