From Code to Figure: A FAIR-Aligned Data Provenance… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a chef who has spent years perfecting a complex recipe for a dish that changes slightly every time you cook it. One day, you publish a photo of the final dish in a cookbook. A year later, someone tries to recreate it, but they can't. Why? Because they don't know exactly which version of the recipe you used, what specific brand of ingredients you had in your pantry that day, or if you tweaked the oven temperature mid-cook.

This paper, written by Markus Uehlein and his team, is about solving that exact problem for scientists who run computer simulations instead of cooking meals. In the world of "numerical physics" (using computers to model how materials behave), the "recipes" are software codes that are constantly being updated, and the "dishes" are massive datasets.

Here is how the authors propose to keep everything traceable, using a simple, four-step workflow they call a Data Provenance Chain.

1. The Recipe Book (Version Control & Code Review)

In the past, if a scientist changed a line of code, they might just save it as simulation_final_v2_real_final.cpp. This is a recipe disaster waiting to happen.

The authors use a system called Git (think of it as a time-traveling recipe book). Every time someone changes the code, it gets a unique timestamp and a "review" from a colleague before it's saved. This ensures that if you look at a simulation from five years ago, you can see the exact version of the code used, down to the specific line of text. It's like having a photo of the chef's hands and the exact ingredients on the counter at the moment the dish was made.

2. The Safety Checks (Automated Testing)

Before a simulation runs, the software performs automatic "safety checks."

Unit Checks: The code checks if the math makes sense physically. For example, it won't let you add "meters" to "seconds" (you can't add distance to time!). If you try, the computer stops you before the simulation even starts.
Physics Checks: The code runs tiny test simulations to make sure the physics behaves the way it should (e.g., "If I heat this up, does the energy go up?"). If the answer is no, the system knows something is broken.

3. The "Black Box" Recorder (Structured Logging & Metadata)

When the simulation actually runs, it doesn't just spit out a list of numbers. It creates a hierarchical file (a fancy digital folder structure) that acts like a "black box" recorder on an airplane.

Inside this file, the scientists store:

The raw data (the results).
The exact input settings (the recipe).
The "build log" (what version of the code was used).
The environment (what kind of computer CPU was used).
A diary of the run (any warnings or errors that happened while it was cooking).

They use a standard format called HDF5/NeXus. Think of this as a universal container that keeps the data organized so that even if the original scientist forgets what they did, anyone else can open the box and understand exactly what happened.

4. The Plating (From Data to Figures)

Finally, the scientists turn that raw data into the pretty graphs and pictures you see in a published paper. Usually, this step is messy—scientists might write a one-off script to make a graph and then delete it.

In this workflow, the step to make the picture is also version-controlled. The script used to make the graph is saved, and the graph itself is stamped with a link back to the raw data and the code used to make it.

The Big Picture: The "Chain of Custody"

The main point of this paper is that these four steps shouldn't be separate islands. They need to be a chain.

Old Way: You publish a picture. Someone asks, "How did you get this?" You say, "I ran a simulation." They ask, "Which one?" You say, "I think it was the one from last Tuesday." Reproducibility fails.
New Way (The Paper's Method): You publish a picture. You click a link, and it shows you the exact code version, the exact input file, the computer it ran on, and the script used to make the picture. Reproducibility succeeds.

The authors tested this on their own long-running simulation software (called monstr), which has been used for many studies over several years. They showed that by linking the code, the data, and the figures together, they created a system where anyone can trace a published result all the way back to the original software state, ensuring that scientific findings remain reliable and reusable for the long term.

In short: They built a system where every scientific result comes with its own "receipt" that proves exactly how it was made, preventing the "it works on my machine" problem from ruining scientific trust.

1. Problem Statement

Computational physics increasingly relies on large simulation datasets generated by software that evolves over many years. This creates significant reproducibility challenges:

Long-lived Software: Simulation frameworks often outlive the tenure of individual researchers, leading to multiple contributors and overlapping development cycles.
Broken Provenance: Reproducibility requires more than just storing output files; it demands an explicit, traceable link between specific code versions, simulation inputs, runtime configurations, analysis steps, and final published figures.
Data Management Costs: Ineffective data management leads to wasted research effort and billions of euros in annual costs (specifically noted in the EU context).
FAIR Compliance Gap: While the FAIR principles (Findable, Accessible, Interoperable, Reusable) are established, integrating them into the entire software development and simulation lifecycle remains a practical challenge.

2. Methodology

The authors propose an integrated workflow that connects software development, simulation execution, structured data storage, and standardized post-processing. This workflow is demonstrated using the monstr (modular object-oriented nonequilibrium spin- and time-resolved relaxation) simulation framework, written in C++.

The methodology is structured into four connected steps:

A. Software Development & Executable Provenance

Version Control: Uses Git with a remote GitLab instance for branch-based development, issue tracking, and mandatory code reviews.
Executable State Capture: The build system records not only the Git commit identifier but also local, uncommitted source modifications at build time. This ensures the exact source state used to generate the executable is preserved.
Environment Logging: Metadata includes CPU models and MPI configurations to ensure the execution context is reproducible.

B. Implementation Safeguards & Quality Assurance

Dimensional Analysis: Uses the Boost.Units library for compile-time dimensional analysis. Physical quantities are assigned types (e.g., Energy, Volume), allowing the compiler to reject dimensionally inconsistent expressions.
Numerical Stability: Internally, simulations use atomic units (Hartree energy, reduced Planck constant, etc.) to minimize floating-point round-off errors across orders of magnitude, while maintaining SI units for input/output interfaces.
Automated Testing: A GitLab CI pipeline runs automated unit and integration tests (using GoogleTest) whenever code is pushed. These tests verify physical consistency (e.g., ensuring electron internal energy increases monotonically with temperature).
Documentation: Documentation (generated via Doxygen) is rebuilt automatically to stay synchronized with the codebase.

C. Validated Inputs & Structured Logging

Input Validation: Configuration parameters (models, materials, solvers) are defined in YAML files. A separate, version-controlled YAML database defines material systems to ensure consistency, with validation performed before calculation launch.
Runtime Diagnostics: Physics-based diagnostics (e.g., particle number and energy conservation) are monitored during execution.
Structured Logging: Uses the spdlog library to generate logs with severity levels (debug, info, warning, error). These logs are stored alongside results to provide an execution context.

D. Hierarchical Storage (HDF5/NeXus)

Format: Data is stored in HDF5 (Hierarchical Data Format) files, adhering to the NeXus standard.
Structure: Files contain groups (containers) and datasets (numerical data).
Metadata Integration: The file structure includes:
- Scientific results (vectors, matrices).
- Execution metadata (Git commit, local diffs, CPU/MPI config).
- Input files (YAML) and runtime logs.
- NeXus Attributes: Defines physical units and signal/axis relationships for standardized visualization.
Interoperability: Files can be inspected via H5Web and processed via Python (nexusformat), ensuring long-term accessibility.

E. Standardized Post-Processing & Publication

Versioned Analysis: A separate, version-controlled Python library handles all post-processing and figure generation. This avoids ad-hoc scripts.
Provenance Propagation: Analysis scripts extract identifiers (Git commit, input file hash) directly from the NeXus output.
Figure Metadata: When figures are exported, the analysis library's commit ID and the source dataset's persistent identifier are embedded in the figure metadata.
Data Publication: Raw datasets are published in repositories (e.g., Zenodo, NOMAD) with Persistent Identifiers (DOIs), linked directly to the manuscript.

3. Key Contributions

End-to-End Provenance Chain: The paper demonstrates a practical implementation of a chain linking Code $\to$ Executable State $\to$ Input $\to$ Output $\to$ Analysis $\to$ Figure.
Executable State Fidelity: By capturing local build-time modifications and environment details, the authors ensure that a simulation run can be reconstructed even years later, not just by the code version but by the exact source state.
FAIR Integration in Physics: The workflow moves beyond theoretical FAIR principles to a concrete engineering implementation using C++ (Boost, HighFive), Python, and HDF5/NeXus standards.
Automated Quality Assurance: The integration of compile-time dimensional checks and physics-based integration tests into the CI pipeline significantly reduces silent numerical errors.

4. Results & Demonstration

Application: The workflow was applied to the monstr framework, which has been in active development since 2019 and supports diverse research topics (ultrafast spin dynamics, electron-phonon coupling, laser-matter interaction).
Traceability: The system successfully generates NeXus files where every data point can be traced back to the specific source code commit, local modifications, input YAML, and analysis script version used to create it.
Reusability: The standardized NeXus format allows different analysis scripts to process data from different physical models without modification, provided the schema is consistent.
Publication: The authors have already published NeXus datasets associated with previous studies on Zenodo, demonstrating the feasibility of the publication step.

5. Significance

Scientific Quality Assurance: The paper argues that sustainable software practices (version control, testing, structured logging) are not optional engineering overhead but essential components of scientific quality assurance in numerical physics.
Long-term Reproducibility: The approach specifically addresses the "long-lived software" problem, ensuring that research remains reproducible despite personnel turnover and software evolution.
Generalizability: While demonstrated in C++ for solid-state physics, the methodology (Git, CI, HDF5/NeXus, versioned analysis) is language-agnostic and applicable to other data-intensive scientific fields, including experimental workflows.
Shift in Culture: The authors advocate for a shift where traceability becomes part of everyday scientific practice rather than a retrospective record-keeping task, ultimately reducing research costs and increasing trust in computational results.

From Code to Figure: A FAIR-Aligned Data Provenance Chain for Reproducible Simulation Research in Numerical Physics