The BOS-TMC Dataset: DFT Properties of 159k Experimentally Characterized Transition Metal Complexes Spanning Multiple Charge and Spin States

This paper introduces the BOS-TMC dataset, a comprehensive collection of over 2.9 million DFT properties for 159,000 experimentally characterized transition metal complexes across multiple charge and spin states, designed to serve as a high-fidelity foundation for machine learning, DFT benchmarking, and chemical exploration.

Aaron G. Garrison, Jacob W. Toney, Tatiana Nikolaeva, Roland G. St. Michel, Christopher J. Stein, Heather J. Kulik

Published 2026-04-10
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot how to understand the chemistry of the future. To do this, the robot needs a massive library of examples. It needs to see millions of different molecules, understand how they behave, and learn the rules that govern their energy and stability.

For a long time, this library was missing a crucial section: Transition Metal Complexes. These are molecules with a metal atom (like iron, copper, or gold) at the center, surrounded by other atoms. They are the "workhorses" of chemistry, used in everything from making medicines to creating solar panels. But they are notoriously difficult to study because they are messy, can exist in many different "moods" (spin states), and often carry electrical charges.

This paper introduces BOS-TMC, a massive new digital library designed to fix that problem. Here is the story of how they built it, explained simply.

1. The Problem: The "Messy Attic"

Think of the Cambridge Structural Database (CSD) as a giant, dusty attic filled with millions of blueprints for these metal molecules. These blueprints were drawn by real scientists using X-ray machines, so they are real, physical structures.

However, the attic is messy:

  • Missing Labels: Many blueprints don't say what the total electrical charge of the molecule is.
  • Multiple Personalities: A single molecule can change its "personality" (spin state). It can be calm (low-spin), energetic (intermediate-spin), or wild (high-spin). Old datasets mostly ignored these different personalities.
  • Broken Blueprints: Some blueprints were just sketches that didn't match the real chemistry.

2. The Solution: The "Digital Renovation Crew"

The authors (a team from MIT and the Technical University of Munich) acted like a high-tech renovation crew. They went into the attic and did three main things:

A. Cleaning and Labeling (Data Curation)
They wrote a smart computer program to go through 299,000 blueprints. They threw out the broken ones, fixed the missing hydrogen atoms (the tiny "glue" atoms), and used a clever math trick to figure out the electrical charge of every single molecule.

  • The Result: They ended up with 159,000 clean, verified, real-world metal complexes.

B. Exploring the "Moods" (Spin States)
Most old datasets only looked at the "calm" version of a molecule. But in reality, these molecules can get excited. The team calculated the properties of these molecules in up to three different moods (low, intermediate, and high spin).

  • The Result: Instead of 159,000 molecules, they now have 343,800 unique "molecule + mood" combinations. It's like having a photo of a person smiling, frowning, and laughing, rather than just one photo.

C. The "Do Not Touch" Rule (Preserving Reality)
Here is the most important part. Usually, when scientists use computers to study molecules, they let the computer "relax" the structure, moving atoms around to find the perfect theoretical shape.

  • The Analogy: Imagine taking a photo of a crumpled piece of paper and then using a computer to smooth it out perfectly. You lose the crumple, which was part of the reality.
  • The BOS-TMC Approach: They said, "No!" They kept the heavy atoms exactly where the X-ray machine found them. They only moved the tiny, invisible hydrogen atoms to make the math work. This ensures the data reflects real, physical chemistry, not just a perfect computer fantasy.

3. The Treasure Chest: What's Inside?

Once they had the structures, they ran them through a super-accurate calculator (DFT) to generate a massive list of properties. They didn't just stop at one number; they generated 2.9 million data points.

Think of this as a "Molecular ID Card" for every single entry, containing:

  • Energy Levels: How hard is it to steal an electron? (HOMO/LUMO)
  • The Gap: How much energy does it take to jump from one state to another?
  • The Charge: Where is the electricity concentrated?
  • The Magnetism: How does the molecule react to a magnetic field?
  • Atomization Energy: How much energy would it take to blow the whole molecule apart into individual atoms?

4. The Stress Test: "Which Calculator is Best?"

The team knew that different computer calculators (called "functionals") give different answers. To test this, they took a smaller sample of 10,000 molecules and ran them through 12 different calculators.

  • The Finding: They found that for some molecules, the calculators agreed perfectly. But for others—especially Copper (Cu) and Nickel (Ni) complexes—the calculators disagreed wildly (sometimes by huge amounts).
  • Why it matters: This highlights exactly where our current scientific tools are weak. It tells future researchers, "Hey, if you are studying Copper, be careful which calculator you use!"

5. Why Should You Care?

This dataset is a game-changer for two reasons:

  1. For AI and Machine Learning: If you want to train an AI to discover new drugs or better batteries, you need high-quality, diverse data. BOS-TMC provides the "diverse diet" of metal chemistry that AI models have been starving for. It covers charged molecules and different spin states that previous datasets ignored.
  2. For Scientists: It serves as a "gold standard" benchmark. If a new computer method claims to be better at predicting chemistry, scientists can test it against BOS-TMC to see if it actually works on real, messy, charged, multi-mood molecules.

The Bottom Line

The BOS-TMC dataset is like building a massive, high-definition map of a previously uncharted territory. It doesn't just show the mountains; it shows the valleys, the different weather patterns (spin states), and the electrical storms (charges). By keeping the structures true to their real-world X-ray photos, it gives scientists and AI a reliable foundation to build the next generation of chemical discoveries.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →