Accurate Chemistry Collection: Coupled cluster… — Plain-Language Explanation

Original authors: Sebastian Ehlert, Jan Hermann, Thijs Vogels, Victor Garcia Satorras, Stephanie Lanius, Marwin Segler, Klaas J. H. Giesbertz, Derk P. Kooi, Kenji Takeda, Chin-Wei Huang, Giulia Luise, Rianne van den Be

Published 2026-02-17

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to build a perfect recipe book for cooking. To do this, you need to know the exact energy required to take a finished dish apart, atom by atom, back into its raw ingredients. In the world of chemistry, this "energy to break a molecule apart" is called the Total Atomization Energy (TAE).

For decades, scientists have struggled to create a massive, accurate recipe book that covers every type of dish (molecule) imaginable, not just the popular ones. Existing books were either too small, only covered simple dishes (like organic carbon-based molecules), or the measurements were just "good enough" but not precise enough for cutting-edge science.

This paper introduces a new, massive, and ultra-precise dataset called MSR-ACC/TAE25. Think of it as the "Encyclopedia Britannica" of molecular energy, but built with a level of precision that was previously impossible to achieve at this scale.

Here is a breakdown of what they did, using simple analogies:

1. The Goal: The "Gold Standard" Kitchen

The researchers wanted to create a dataset where the energy measurements are accurate to within 1 calorie per mole (a tiny amount). This is called "sub-chemical accuracy."

The Problem: Previous datasets were like a small cookbook with only 100 recipes. Some were huge but had sloppy measurements.
The Solution: They created a library of 73,040 recipes (molecules), all measured with a "gold standard" ruler (a method called CCSD(T)/CBS). This is the most accurate ruler available for general chemistry.

2. The Ingredients: A Diverse Pantry

Most chemistry datasets focus on "organic" molecules (the kind found in living things, made mostly of Carbon, Hydrogen, Oxygen, and Nitrogen).

The Innovation: MSR-ACC/TAE25 is like a pantry that includes everything from the first three rows of the Periodic Table (up to Argon). It includes metals like Lithium and Sodium, and elements like Silicon and Phosphorus.
The Constraint: They only included molecules that are stable and "closed-shell" (meaning their electrons are paired up nicely, like a happy couple). They excluded unstable, chaotic molecules that would break the measuring tools.

3. The Process: How They Built the Library

Building this library wasn't just about looking up numbers; they had to invent the molecules first. They used a three-step assembly line:

Step A: Drawing the Blueprints (Graph Generation)
Imagine a robot that draws every possible way to connect Lego bricks (atoms) together. They used three different strategies:
1. Brute Force: Trying every single combination for small molecules.
2. Sampling: Randomly picking combinations for larger molecules, ensuring they follow the rules of chemistry (valency).
3. AI Prediction: Using a smart AI (based on the GPT-2 architecture) to imagine new molecular shapes that humans hadn't thought of yet. About 20% of the molecules came from this AI's imagination!
Step B: Building the 3D Models
Once they had the blueprints, they had to turn them into 3D structures. They started with a rough sketch, then refined it with a fast computer program, and finally polished it with a super-precise program to ensure the atoms were in their most comfortable, stable positions.
Step C: The Quality Control (Filtering)
Not every blueprint makes a stable house. They ran a series of "stress tests":
- The "Spin" Test: They checked if the molecule would rather be in a "triplet" state (unstable) or a "singlet" state (stable). If it was unstable, they threw it out.
- The "Chaos" Test: Some molecules are so complex that standard math breaks down. They used a diagnostic tool (called %TAE[(T)]) to check for "multireference character" (chaos). If a molecule was too chaotic for their super-precise ruler, they excluded it. This ensured that every molecule in the final list could be measured with extreme confidence.

4. The Result: A Tool for the Future

The final dataset is a massive, open-source treasure chest.

Who is it for? It's for anyone building new ways to predict chemical behavior.
Why does it matter?
- Training AI: Just as you need millions of pictures to teach a computer to recognize cats, scientists need millions of accurate energy values to train AI to predict how new drugs or materials will behave.
- Testing Theory: It acts as a "final exam" for new chemical theories. If a new computer program can't predict the energy of these 73,000 molecules correctly, the scientists know the program needs fixing.
- Beyond Organic Chemistry: Because it includes metals and other elements, it helps scientists design better batteries, solar cells, and industrial catalysts, not just new medicines.

The Bottom Line

Think of this paper as the release of the ultimate GPS map for the chemical world. Before, scientists had a map that was great for the city center (organic chemistry) but fuzzy and incomplete for the countryside (inorganic chemistry). Now, they have a high-definition, 3D map of the entire territory, allowing them to navigate the chemical space with unprecedented precision and speed.

This dataset is freely available to everyone, meaning the next breakthrough in clean energy or medicine might just be a few lines of code away, powered by this new, ultra-accurate data.

1. Problem Statement

Accurate thermochemical data, specifically within sub-chemical accuracy (defined as within 1 kcal mol⁻¹ of empirical ground truth), is critical for advancing computational chemistry and developing data-driven methods. However, existing datasets face significant limitations:

Size vs. Accuracy Trade-off: High-accuracy datasets (e.g., W4 series) are limited to small molecules due to the prohibitive cost of Full Configuration Interaction (FCI) or high-level composite methods.
Scope Limitations: Large datasets (e.g., GDB-9, G4(MP2)) often rely on lower-level approximations or empirical corrections that may fail for non-standard bonding or inorganic systems.
Chemical Diversity Gap: There is a lack of comprehensive datasets covering both organic and inorganic closed-shell, neutral molecules across the first three periods of the periodic table with rigorous ab initio accuracy.

This gap hinders the development and validation of machine learning (ML) models, density functional theory (DFT) functionals, and semi-empirical methods across a broad chemical space.

2. Methodology

The authors developed the Microsoft Research Accurate Chemistry Collection (MSR-ACC), with its first release, MSR-ACC/TAE25. The methodology involves a multi-stage pipeline for structure generation, filtering, and high-level labeling.

A. Structure Generation

The dataset targets closed-shell, charge-neutral, covalently bound equilibrium structures containing up to 5 non-hydrogen atoms drawn from elements H through Ar (excluding noble gases).

Graph Generation: Three distinct approaches were used to maximize diversity:
1. Brute-force enumeration: Exhaustive generation of graphs for up to 4 non-hydrogen atoms.
2. Degree sequence sampling: Sampling atoms and bond types (single, double, triple) respecting valency constraints, performed with both implicit and explicit hydrogen handling.
3. Generative AI: A GPT-2 transformer model trained on ~6M valid SMILES strings to generate ~1.5M novel graphs (85% novelty rate).
3D Optimization:
1. Initial 3D placement via UFF.
2. Conformational sampling and optimization via GFN2-xTB.
3. Refinement via r2SCAN-3c and final optimization via B3LYP-D3(BJ)/def2-TZVPP.
4. Duplicate Removal: Molecules were filtered based on reconstructed molecular graphs and Total Atomization Energy (TAE) at the respective theory levels.

B. Filtering Criteria

To ensure the applicability of the CCSD(T) method, strict filters were applied:

Electronic Ground State: Singlet-Triplet gaps ( $S_0-T_1$ ) calculated at B3LYP/def2-TZVP must be positive. This removed ~5% of structures (those with triplet ground states).
Multireference Character: The diagnostic %TAE[(T)] (the fraction of TAE accounted for by connected triple excitations) was calculated at CCSD(T)/6-31G(d). Structures with %TAE[(T)] > 6% were discarded to avoid systems where CCSD(T) fails due to strong nondynamical correlation. This removed another ~5%.
Fragmentation: Molecules dissociating into covalently disconnected fragments were excluded.

C. High-Accuracy Labeling (W1-F12)

The final dataset was labeled using the W1-F12 composite wavefunction protocol to achieve CCSD(T)/CBS (Complete Basis Set) accuracy.

Protocol: Extrapolation of Hartree-Fock, CCSD, and (T) components to the CBS limit using explicitly correlated F12 methods.
Corrections: Includes core-valence (CV) corrections and specific basis set extrapolations (e.g., cc-pVDZ-F12, cc-pVTZ-F12).
Scope: All molecules with $\le$ 4 non-hydrogen atoms were labeled. A representative subsample of molecules with 5 non-hydrogen atoms was also labeled to ensure diversity.

D. Subsampling Strategy

From ~1M generated structures, a subset was selected for expensive W1-F12 labeling. The subsampling was optimized along three axes to maximize chemical diversity while minimizing computational cost:

Number of non-hydrogen atoms.
Presence of s-block elements.
Presence of 3rd-period elements.

3. Key Contributions

Dataset Scale and Scope: The release of MSR-ACC/TAE25, containing 73,040 total atomization energies. This is the largest dataset of its kind with sub-chemical accuracy, covering a vast chemical space including organic, inorganic, and mixed s/p-block systems.
Rigorous Accuracy: All data points are calculated at the CCSD(T)/CBS level using the W1-F12 protocol, ensuring "sub-chemical" accuracy (targeting <1 kcal mol⁻¹ error).
Open Access and Format: The dataset is available on Zenodo under the CDLA Permissive 2.0 license in the QCSchema format, facilitating immediate integration into ML and computational workflows.
Canonical Splits: Pre-defined 99% training and 1% validation splits are provided, with overlap removed against standard benchmarks (W4-17, GMTKN55) to prevent data leakage.
Auxiliary Data: Includes DFT atomization energies, singlet-triplet gaps, and W1-F12 energy components (HF, $\Delta$ CCSD, $\Delta$ (T), $\Delta$ CV) for technical validation.

4. Results and Technical Validation

Chemical Diversity:
- Composition: 45.1% organic (containing C) and 54.9% inorganic.
- Elements: Covers H, Li, Be, B, C, N, O, F, Na, Mg, Al, Si, P, S, Cl.
- Bonding: Contains 287,000 non-hydrogen bonds, including non-traditional bonding situations (e.g., Li-Na, Be-Mg) not found in drug-like datasets like GDB-9.
- Structures: Includes linear (0.6%), planar (15.2%), and general 3D (84.3%) geometries.
Filtering Efficacy:
- The %TAE[(T)] diagnostic effectively removed multireference systems. The distribution peaks at ~2% and cuts off sharply at 6%.
- The singlet-triplet gap filter ensured all retained molecules are in their electronic ground state.
- Validation against the W4-17 dataset showed that the generation pipeline successfully recovered nearly all known stable structures (missing only fundamental exceptions like diborane due to graph constraints or unstable isomers).
Benchmarking:
- The dataset was used to evaluate various DFT functionals (e.g., B3LYP, $\omega$ B97X-V, M06-2X).
- Error distributions were found to follow normal distributions, allowing for robust identification of outliers and systematic errors in approximate methods.
- The dataset confirmed that W1-F12 achieves a Mean Absolute Deviation (MAD) of ~0.51 kcal mol⁻¹ against W4 reference values, validating its reliability.

5. Significance

Advancing Data-Driven Chemistry: MSR-ACC/TAE25 provides the necessary "ground truth" to train deep learning models (e.g., Graph Neural Networks) and ML potentials that can generalize beyond typical organic chemistry to inorganic and mixed systems.
Method Development: It serves as a rigorous benchmark for developing new exchange-correlation functionals and semi-empirical methods, particularly for challenging s- and p-block compounds where current methods often struggle.
Standardization: By providing a large, diverse, and high-accuracy dataset in a standardized format, it enables the community to systematically identify and correct errors in electronic structure theories across a broad chemical space.
Future Outlook: The authors envision expanding MSR-ACC to include larger molecules and higher accuracy levels, continuing to push the boundaries of computational thermochemistry.

In summary, MSR-ACC/TAE25 bridges the gap between the high accuracy of small-molecule benchmarks and the broad chemical diversity required for next-generation computational chemistry tools.

Accurate Chemistry Collection: Coupled cluster atomization energies for broad chemical space