Ontology-based knowledge graph infrastructure for interoperable atomistic simulation data

This paper presents an ontology-based knowledge graph infrastructure that integrates heterogeneous atomistic simulation data and workflows into a standardized, machine-readable format to enhance data findability, interoperability, and reuse.

Original authors: Abril Azocar Guzman, Sarath Menon, Tilmann Hickel, Stefan Sandfeld

Published 2026-04-09
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a chef trying to recreate a famous dish from a recipe book. But here's the catch: every time you look at a different book, the ingredients are listed in different languages, the measurements are in random units (cups, grams, "a pinch"), and the instructions are written in code only the original chef understands. Worse yet, the books don't tell you why the chef chose those specific ingredients or what happened if they made a mistake.

This is exactly the problem scientists face with atomistic simulation data. These are computer experiments that simulate how atoms behave to create new materials. Currently, this data is scattered across thousands of files, stored in incompatible formats, and often missing the "story" of how it was made. It's like having a library where every book is written in a different dialect, and half the pages are torn out.

This paper presents a solution: A Universal Translator and a Master Librarian for Atomic Data.

Here is how they did it, explained simply:

1. The "Universal Dictionary" (The Ontologies)

First, the team built a massive, shared dictionary called an Ontology. Think of this as a strict set of rules for how to describe things.

  • Before: One scientist might call a material "Iron-Defect-Alpha," while another calls it "Fe-Vacancy-01." A computer has no idea these are the same thing.
  • After: The Ontology says, "No matter what you call it, if it's Iron with a missing atom, we will all call it Iron:Vacancy."
  • They created two main dictionaries:
    • CMSO: Describes the "ingredients" (the materials, the atoms, the defects).
    • ASMO: Describes the "cooking method" (the software used, the math formulas, the steps taken).

2. The "Smart Translator" (The Software Infrastructure)

Even with a dictionary, scientists don't want to stop using their familiar tools (like Excel or Python scripts) to write their notes. They don't want to learn a new, complex language just to be "correct."

So, the team built a middleware layer (called atomRDF).

  • Imagine a translator sitting between the scientist and the library.
  • The scientist writes their notes in their usual format (YAML, JSON, or even a simple text file).
  • The translator instantly converts those notes into the "Universal Dictionary" format and files them away in a giant, connected database called a Knowledge Graph.
  • The scientist doesn't have to change their workflow; the system does the heavy lifting in the background.

3. The "Giant Connected Web" (The Knowledge Graph)

Instead of storing data in separate, isolated folders, they built a Knowledge Graph.

  • Old Way: Data is like a stack of index cards in different drawers. To find a connection, you have to manually pull out card A, then card B, and hope they match.
  • New Way: The Knowledge Graph is like a giant, glowing spiderweb. Every piece of data (an atom, a temperature, a software code) is a node. Every relationship (was calculated by, is made of, depends on) is a string connecting them.
  • Because everything is connected, you can ask the web complex questions like: "Show me all the energy calculations for Copper defects made using Method X, but only if the temperature was above 500 degrees." The web lights up and gives you the answer instantly, even if that data came from five different research groups.

What Can You Do With This? (The Magic Tricks)

The paper shows three cool things this system can do:

  • The "Detective" (Cross-Dataset Analysis):
    They took data about "grain boundaries" (where crystals meet) from many different sources. Because the data was standardized, they could instantly see patterns that were invisible before. For example, they could see that certain types of boundaries are stable in Copper but not in Aluminum, simply by querying the web. It's like being able to compare every recipe for "Chocolate Cake" ever written to find the perfect one instantly.

  • The "Time Traveler" (Deriving New Science):
    Sometimes scientists calculate data but forget to calculate the final result. The team found old data about how materials expand when heated. By connecting the dots in the graph (Volume + Temperature + Time), they mathematically derived a new property (Thermal Expansion) that the original authors never explicitly published. They turned "dust" into "gold."

  • The "Replay Button" (Provenance & Reconstruction):
    This is perhaps the most powerful feature. In science, knowing how you got a result is as important as the result itself.

    • The system records the entire "cooking video" of the simulation.
    • If you find a result, you can press "Rewind" and see exactly which software, which version, and which settings were used.
    • Better yet, the system can try to rebuild the recipe automatically. It can generate a new script that says, "Here is the code to recreate this exact experiment." This solves the "it worked on my computer" problem forever.

Why Does This Matter?

Currently, a lot of scientific data is "orphaned"—it exists, but it's too messy to use again. This infrastructure turns that messy pile of data into a FAIR resource:

  • Findable (You can search for it easily).
  • Accessible (Anyone can get it).
  • Interoperable (Different computers can talk to each other).
  • Reusable (You can use old data to do new science).

In short, this paper builds the operating system for the future of materials science. It stops scientists from wasting time translating files and starts them spending time discovering new materials, from better batteries to stronger metals, by letting their computers do the organizing.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →