The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models

Imagine you are trying to teach a computer to be a master chemist. You want it to predict how molecules behave, how drugs will fit into the human body, or how new batteries will store energy.

For decades, the only way to do this accurately was to use Density Functional Theory (DFT). Think of DFT as a super-precise, super-slow physics engine. It calculates the behavior of every single electron in a molecule. It's like trying to simulate a hurricane by tracking the path of every single raindrop. It's incredibly accurate, but it takes so much computing power that you can only simulate tiny things for a split second.

Machine Learning (ML) offers a shortcut. If you show a computer enough examples of how molecules behave, it can learn the patterns and predict the answer instantly, like a seasoned chef guessing a recipe's taste without measuring every spice. But here's the problem: The computer needs a massive library of recipes to learn from.

Until now, that library was too small, too simple, or too messy. It was like trying to teach a chef to cook a global banquet using only a cookbook with 1,000 recipes for plain toast.

Enter Open Molecules 2025 (OMol25).

The "Encyclopedia of Everything"

The researchers at Meta FAIR and their partners built the OMol25 dataset. Think of this as the "Library of Alexandria" for molecules.

The Scale: They didn't just add a few more recipes; they generated 140 million high-precision calculations. That's like filling a library with billions of pages of chemistry.
The Diversity: Previous datasets were like a library that only had books about apples. OMol25 has books on apples, elephants, spaceships, and ocean currents. It covers:
- 83 different elements (almost the whole periodic table).
- Biomolecules: How proteins and DNA interact (crucial for drug discovery).
- Metal Complexes: The weird, flexible structures used in catalysts and batteries.
- Electrolytes: The soupy liquids inside batteries that make them work.
- Reactivity: Molecules in the middle of breaking apart or joining together (like a car crash in slow motion).

How They Built It: The "Virtual Lab"

You can't just go to a lab and run 140 million experiments; it would take a million years and cost more than the GDP of a small country.

Instead, they built a virtual lab. They used a super-computer cloud (Meta's private cloud) to run these simulations.

The Analogy: Imagine a factory that builds toy cars. Usually, they build one car, test it, and move on. With OMol25, they built a factory that builds 140 million cars in different colors, sizes, and conditions (some in the rain, some on fire, some upside down) all at once.
The Cost: This required 6.6 billion CPU hours. That's like running a single computer non-stop for 750 years! They did this by using "idle" computers that were sitting around at Meta, turning wasted electricity into scientific gold.

The "Test Drive" (Evaluations)

Just having the data isn't enough; you need to know if the AI actually learned anything. The paper introduces a series of challenge courses to test the AI models:

The "Lock and Key" Test: Can the AI predict how well a drug molecule (the key) fits into a protein (the lock)?
The "Stretch and Snap" Test: Can it predict how much energy is needed to bend a molecule before it breaks?
The "Charge" Test: Can it handle molecules that have gained or lost electrons (like a battery charging)?
The "Spin" Test: Can it predict what happens when the tiny magnetic spins of electrons change?

The Results: A New Era

They trained several AI models on this massive dataset and ran them through the challenge courses.

The Winners: Models like UMA and GemNet-OC performed incredibly well. In many areas, they reached "chemical accuracy" (meaning they are almost as good as the slow, expensive physics engine, but millions of times faster).
The Gap: While they are great at predicting stable molecules, they still struggle a bit with the most chaotic scenarios, like complex chemical reactions or long-range forces in batteries. This tells scientists exactly where to focus their next round of improvements.

Why This Matters to You

This isn't just about fancy math. This dataset is the foundation for the next generation of technology:

Medicine: Designing new drugs that cure diseases without the side effects, by simulating how they interact with your body before ever testing on a human.
Energy: Creating better, safer, and longer-lasting batteries for your phone and electric car.
Materials: Discovering new materials that are stronger, lighter, or more conductive.

In short: The authors didn't just build a bigger dataset; they built a universal training ground. They gave the AI a "PhD" in chemistry by feeding it a diet of 140 million high-quality examples. Now, the rest of the world can use this data to build AI that helps us solve some of humanity's biggest problems, from curing cancer to saving the climate.

Here is a detailed technical summary of the paper "The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models."

1. Problem Statement

Machine Learning Interatomic Potentials (MLIPs) hold the promise of replacing Density Functional Theory (DFT) by offering quantum-chemical accuracy at a fraction of the computational cost. However, the development of robust, general-purpose MLIPs is hindered by a lack of comprehensive training data. Existing datasets suffer from several critical limitations:

Limited Chemical Diversity: Most datasets focus on small organic molecules (C, H, O, N, F) or specific classes like proteins, failing to cover the full periodic table.
Narrow Scope: They often lack critical chemical phenomena such as variable charge/spin states, explicit solvation, reactive trajectories, and large system sizes (>50 atoms).
Inconsistent Accuracy: Previous large-scale datasets often use varying levels of DFT theory, making it difficult to train models that generalize across different chemical domains.
Evaluation Gaps: Standard metrics (e.g., Mean Absolute Error on random splits) do not adequately assess a model's utility for practical chemistry tasks like binding affinity, reactivity, or long-range interactions.

2. Methodology

A. Dataset Construction (OMol25)

The authors constructed Open Molecules 2025 (OMol25), a massive dataset comprising >140 million DFT single-point calculations.

Level of Theory: All calculations use the high-accuracy range-separated hybrid meta-GGA functional $\omega$ B97M-V with the def2-TZVPD basis set (including diffuse functions). This ensures consistency and high fidelity across the entire dataset.
Scale: The dataset required 6.6 billion CPU core-hours of computation, primarily utilizing Meta's private cloud infrastructure.
Chemical Scope: It covers 83 elements (up to Bismuth), with system sizes ranging from 2 to 350 atoms (average ~50).
Domains: The data is organized into four primary domains:
1. Biomolecules: Protein-ligand, protein-protein, and protein-nucleic acid interactions. Structures are extracted from PDB/BioLiP2, capped, and simulated via Molecular Dynamics (MD) to generate diverse conformers.
2. Metal Complexes: Includes organometallics and coordination complexes generated using the Architector package. Covers diverse oxidation states, spin states, and ligand environments. Includes reactivity data via the Artificial Force Induced Reaction (AFIR) method.
3. Electrolytes: Simulates aqueous and non-aqueous solutions, ionic liquids, and molten salts. Uses classical MD (OPLS4) and Ring Polymer MD (RPMD) to capture nuclear quantum effects and interfacial phenomena.
4. Main-Group Molecules: Covers heavy p-block elements, reactive trajectories, noble gases, and unusual protonation/ionization states.
Community Integration: The authors recomputed popular existing datasets (e.g., ANI-2X, SPICE2, GEOM, Transition-1X) to the $\omega$ B97M-V/def2-TZVPD level to ensure consistency and fill missing data (e.g., forces).

B. Data Splits

To ensure rigorous evaluation, the dataset is split into:

Training/Validation: Based on molecular composition (formula).
Out-of-Distribution (OOD) Test Sets: Specifically designed to test generalization on:
- Unseen metal-ligand bonds.
- Metal-containing protein structures (PDB-TM).
- New electrolyte anions.
- Experimental crystal structures (COD).
- Reactive pathways not present in training.

C. Evaluation Framework

The paper introduces a comprehensive suite of 10 evaluation tasks beyond simple energy/force prediction, including:

Protein-Ligand Interaction: Interaction energy and force MAE.
Ligand Strain: Energy difference between bioactive and global minimum conformers.
Conformer Ranking: Ability to identify the lowest energy conformer among many.
Protonation Energies: $\Delta E$ between different protonation states.
Unoptimized IE/EA and Spin Gaps: Vertical energy differences between charge/spin states without geometry relaxation.
Distance Scaling: Testing short-range vs. long-range interaction scaling (crucial for non-covalent forces).

D. Baseline Models

Several state-of-the-art equivariant and invariant Graph Neural Networks (GNNs) were trained on OMol25, including eSEN, GemNet-OC, MACE, and UMA.

Key Modification: Models were adapted to accept total charge and spin as input embeddings to handle the diverse electronic states in OMol25.
Training Strategy: Multi-stage training (pre-training with lower precision, followed by FP32 fine-tuning) was used for large models.

3. Key Contributions

OMol25 Dataset: The first large-scale (>100M points), high-accuracy DFT dataset spanning organic, inorganic, bio-, and electro-chemistry with consistent theory levels.
Unified Theory Level: By recomputing legacy datasets to $\omega$ B97M-V/def2-TZVPD, the authors provide a "gold standard" baseline for the community.
Comprehensive Evaluation Suite: Moving beyond simple MAE, the paper defines tasks that directly correlate with real-world chemical challenges (e.g., binding, reactivity, spin states).
Baseline Performance: Establishes strong performance benchmarks for current SOTA models, highlighting where they succeed (biomolecules, organics) and where they struggle (spin gaps, long-range interactions).
Open Release: The dataset, model weights, and code are publicly available under permissive licenses, accompanied by a public leaderboard.

4. Results

A. Model Performance on Test Sets

Accuracy: The best-performing model, UMA-M-1.1, achieved a total energy MAE of 1.38 kcal/mol and force MAE of 0.13 kcal/mol/Å on the full test set.
Domain Specifics:
- Biomolecules & Neutral Organics: Models achieved near-chemical accuracy (<1 kcal/mol).
- Metal Complexes & Electrolytes: Errors were higher (2–3 kcal/mol), reflecting the complexity of transition metal electronic structures.
- OOD Performance: Models trained on the full dataset ("All") significantly outperformed those trained on a 4M subset, showing the importance of scale and diversity.
Conserving vs. Direct: Models that enforce energy conservation (calculating forces as gradients of energy) generally outperformed direct force-prediction models.

B. Evaluation Task Results

Ligand Strain & Conformers: Models performed exceptionally well, with strain energy errors often <0.1 kcal/mol and RMSD errors <0.2 Å, indicating excellent structural prediction.
Protonation: Errors were slightly higher (~0.5–1.0 kcal/mol), suggesting challenges in modeling charge localization.
Spin Gaps & IE/EA: These remained the most difficult tasks, with errors ranging from 3 to 9 kcal/mol. This indicates that current architectures struggle to capture subtle electronic state differences without explicit charge/spin handling or more complex electronic structure modeling.
Distance Scaling: Models exhibited significant errors in the long-range regime (>6 Å), confirming that current GNNs with fixed cutoffs fail to capture long-range electrostatic and dispersion interactions accurately.

C. Comparison to Reference Theory

On the Wiggle150 benchmark (strained conformers), OMol25-trained models achieved RMSEs of ~1.0 kcal/mol against CCSD(T)/CBS, a level of accuracy previously only reached by double-hybrid DFT functionals.

5. Significance and Future Outlook

Paradigm Shift: OMol25 represents a leap from small, specialized datasets to a "foundation model" resource for molecular chemistry, enabling the training of universal MLIPs.
Identified Gaps: The evaluation highlights that while MLIPs are ready for biomolecular screening and organic design, significant architectural improvements are needed for:
- Long-range interactions: To model electrolytes and phase changes accurately.
- Electronic states: To handle spin-crossover, ionization, and complex metal chemistry.
Community Impact: By releasing the data, models, and a leaderboard, the authors aim to accelerate the development of next-generation models that can eventually replace DFT for high-throughput screening in drug discovery, battery design, and catalysis.
Future Work: The authors plan to extend evaluations to free energy calculations (requiring Hessians) and spectroscopic properties, further bridging the gap between ML and experimental observables.