Extending OpenKIM with an Uncertainty Quantification… — Plain-Language Explanation

Original authors: Yonatan Kurniawan, Cody L. Petrie, Mark K. Transtrum, Ellad B. Tadmor, Ryan S. Elliott, Daniel S. Karls, Mingjian Wen

Published 2026-05-08

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

CC BY 4.0

Original authors: Yonatan Kurniawan, Cody L. Petrie, Mark K. Transtrum, Ellad B. Tadmor, Ryan S. Elliott, Daniel S. Karls, Mingjian Wen

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a chef trying to recreate a famous dish. You have a recipe (the Interatomic Potential, or IP) that tells you how much salt, pepper, and heat to use. You taste the dish, adjust the spices, and taste again until it's perfect. This is how scientists build models to predict how materials behave at the atomic level.

However, there's a problem: No recipe is perfect. Even if you get the spices right, the recipe itself might be missing a secret ingredient (like a specific type of oil) that the original chef used. If you try to cook a different dish using this same recipe, it might taste terrible because the recipe wasn't designed for that.

This is the core problem this paper addresses: How do we know how much to trust our recipe when we use it for new situations?

Here is a breakdown of the paper's work using simple analogies:

1. The Problem: The "Sloppy" Recipe

In the world of atoms, scientists use mathematical formulas (IPs) to predict energy and forces. These formulas have "knobs" (parameters) that get turned to fit experimental data.

The Issue: Many of these formulas are "sloppy." This means that many different combinations of knob settings can produce the exact same result for the data you trained on. It's like having a recipe where you can double the salt and halve the pepper, and the dish still tastes the same to you, but it might fail completely if you try to bake a cake with it.
The Risk: Because the recipe is sloppy, we don't know which setting is the "true" one. When we use the recipe for new predictions, we might be wildly off, and we won't know it.

2. The Solution: A "Confidence Meter" (Uncertainty Quantification)

The authors, working with a project called OpenKIM (a giant library of these atomic recipes), have built a new toolkit called KLIFF. Think of KLIFF as a smart kitchen assistant that doesn't just cook the dish, but also tells you how confident you should be in the result.

They added a new feature to KLIFF that performs Uncertainty Quantification (UQ). Instead of just giving you one answer, it gives you a range of possibilities and tells you how "wobbly" the answer is.

3. How It Works: The "Parallel-Universes" Cooking Class

To figure out how wobbly the answer is, the toolkit uses a method called MCMC (Markov Chain Monte Carlo). Imagine a cooking class where:

The Chef: You have a main chef who finds the "best fit" recipe (the one that matches your training data perfectly).
The Students: You send out 100 students (called "walkers") to try slightly different versions of the recipe.
The Temperature: Here is the clever part. The students are cooking at different "temperatures."
- Low Temperature: The students are very strict. They only try recipes that are very close to the best fit. They are safe, but they might miss big errors.
- High Temperature: The students are wild. They try crazy combinations of spices. This helps them find out if the recipe breaks down completely if you stray too far from the center.

By mixing the results from these different "temperatures," the toolkit can see how much the recipe changes when you tweak the knobs. If the recipe stays tasty even when the students go wild, the model is robust. If the dish turns into soup when you change the knobs slightly, the model is unreliable.

4. The "Evaporation" Surprise

The paper discovered a fascinating phenomenon they call "Parameter Evaporation."

Imagine you are looking for a specific spot on a map (the best recipe). At low temperatures, everyone agrees on the spot.
As you turn up the "temperature" (making the rules looser to account for the fact that the recipe isn't perfect), the students start wandering off.
Suddenly, for some ingredients (parameters), the students stop wandering in a small circle and start spreading out to the very edges of the map. They "evaporate" from the center.
Why this matters: When this happens, the "best" recipe you found earlier might not even be represented in the group anymore. The model is telling you, "Hey, if we account for the fact that our recipe is imperfect, the 'perfect' setting you found earlier might actually be wrong."

5. The Takeaway for Scientists

The authors built this tool to help scientists:

Stop guessing: Instead of just saying "This model predicts X," they can say, "This model predicts X, but we are only 60% sure because the recipe is sloppy."
Avoid bad decisions: By seeing how the results change at different "temperatures," scientists can avoid trusting a model that looks good on paper but falls apart in reality.
Improve recipes: If the uncertainty is too high, the scientists know they need to gather more data or simplify the recipe (remove the "sloppy" parts) to make it more reliable.

In short: This paper introduces a new tool that acts like a "lie detector" for atomic models. It doesn't just tell you what the model predicts; it tells you how much you should trust that prediction by simulating thousands of slightly different versions of the model to see how stable the results really are.

Technical Summary: Extending OpenKIM with an Uncertainty Quantification Toolkit for Molecular Modeling

Problem Statement
Atomistic simulations are fundamental to materials science, relying heavily on Interatomic Potentials (IPs) to approximate interaction energies. The accuracy of these simulations is contingent upon the choice of IP and its parameters. While the Open Knowledgebase of Interatomic Models (OpenKIM) provides a standardized framework for IP implementation and evaluation, it lacks a unified tool for Uncertainty Quantification (UQ).

A primary challenge in molecular modeling UQ is "sloppiness," where models are ill-conditioned, and many parameter combinations are practically unidentifiable given available data. Furthermore, the dominant source of uncertainty is often not random data noise, but "model inadequacy"—the inability of the IP's functional form to capture all relevant physics. Existing UQ libraries (e.g., emcee, Chaospy) are not specifically integrated for molecular modeling workflows, and standard Bayesian methods often struggle to account for the systematic errors introduced by model inadequacy without specific adjustments.

Methodology
The authors introduce a UQ toolkit extension to KLIFF (KIM-based Learning-Integrated Fitting Framework), a Python package within the OpenKIM ecosystem. The methodology employs a Bayesian approach using Parallel-Tempered Markov Chain Monte Carlo (PTMCMC) to quantify two sources of uncertainty: parameter variations and functional form inadequacy.

Key methodological components include:

Cost Function and Weighting: The framework utilizes a weighted least-squares cost function. To address the dominance of model inadequacy over data noise, the authors adopt a strategy of inflating the likelihood. This is achieved by introducing a hyper-parameter, temperature ( $T$ ), which scales the weights.
Temperature Selection: Drawing an analogy between Bayesian statistics and statistical mechanics, the authors define a natural sampling temperature $T_0 = 2C_0/N$ , where $C_0$ is the cost at the best fit and $N$ is the number of parameters. This $T_0$ serves as an estimate of the scale of model bias.
PTMCMC Implementation: The toolkit implements PTMCMC to sample multiple Markov chains at different temperatures simultaneously. Chains are mixed to improve convergence rates and allow walkers to explore the parameter space more effectively, particularly in the presence of "sloppy" modes.
Convergence Assessment: Convergence is monitored using the multivariate potential scale reduction factor ( $\hat{R}_p$ ). The process terminates when $\hat{R}_p$ falls below a threshold (typically 1.05–1.1).
Software Integration: The toolkit is implemented as a module (kliff.uq) within KLIFF. It allows users to define custom priors (defaulting to uniform), specify temperature ladders, and handle parallelization via multiprocessing pools.

Key Contributions

Integration: The paper presents the first UQ toolkit integrated directly into the OpenKIM framework, standardizing the reporting of uncertainty in molecular modeling workflows.
Handling Model Inadequacy: The implementation explicitly addresses model inadequacy by adjusting the sampling temperature ( $T$ ) to inflate error bars, effectively treating the functional form error as a systematic bias.
Flexibility: The toolkit supports custom weighting schemes for individual data points (extending beyond single weights per property type) and allows for various prior distributions.
Demonstration: The authors demonstrate the framework using a Stillinger–Weber (SW) potential for silicon, training on energies and forces derived from an Environment-Dependent Interatomic Potential (EDIP).

Results
The application of the toolkit to the SW potential for silicon yielded several critical observations:

Parameter Evaporation: As the sampling temperature increases, the marginal posterior distributions of certain parameters (specifically $\lambda$ and $\gamma$ ) abruptly transition from being localized around the best-fit values to spreading out to the boundaries of the prior. This phenomenon, termed "parameter evaporation," indicates that at higher temperatures, the posterior is dominated by high-entropy regions of the parameter space rather than data-fitting regions.
Shift in Best-Fit Estimates: Even for parameters that remain localized (e.g., $A$ and $B$ ), their distributions shift at higher temperatures due to the evaporation of coupled parameters ( $\lambda$ and $\gamma$ ). This suggests that the "best fit" parameters may not be well-represented in the ensemble at temperatures significantly higher than $T_0$ .
Cost Distribution: The distribution of costs shifts to the right (higher values) as temperature increases, not merely by stretching but by shifting the entire distribution, indicating that the posterior is sampling regions of parameter space that are poor fits to the data but have high prior probability.
Convergence: The PTMCMC approach successfully converged with a maximum $\hat{R}_p$ of 1.046 after 150,000 iterations (with burn-in and thinning applied).

Significance and Claims
The authors position this work as a step toward making atomistic simulations more reliable and reproducible by embedding UQ directly into the IP development and application workflow. They emphasize that while the toolkit lowers the barrier to entry for practitioners, UQ remains an emerging field with open questions, particularly regarding model inadequacy.

The paper modestly claims that the toolkit provides a framework for transparent and reproducible UQ analysis rather than a "black box" solution. The authors explicitly caution users against treating the methods as off-the-shelf tools without understanding the statistical subtleties of sloppy models. They recommend that practitioners:

Test the robustness of their conclusions across a range of sampling temperatures and prior choices.
Avoid Jeffreys priors in the presence of degenerate modes due to potential strong biases.
Focus UQ analysis on ensembles generated by temperatures near $T_0$ (specifically 50% below to 50% above), using higher temperatures primarily to aid convergence rather than for final uncertainty estimates.

The authors conclude that IP developers should utilize these tools throughout the model development cycle, potentially using them to identify sloppy parameters for model reduction or to guide the expansion of training data. Future work aims to integrate frequentist methods (profile likelihoods) and model reduction schemes based on information geometry.

Extending OpenKIM with an Uncertainty Quantification Toolkit for Molecular Modeling

1. The Problem: The "Sloppy" Recipe

2. The Solution: A "Confidence Meter" (Uncertainty Quantification)

3. How It Works: The "Parallel-Universes" Cooking Class

4. The "Evaporation" Surprise

5. The Takeaway for Scientists

More like this