The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models

Meta FAIR introduces Open Molecules 2025 (OMol25), a large-scale dataset comprising over 100 million high-accuracy DFT calculations across 83 elements and diverse chemical systems, accompanied by baseline models and evaluations to advance machine learning for molecular simulations.

Daniel S. Levine, Muhammed Shuaibi, Evan Walter Clark Spotte-Smith, Michael G. Taylor, Muhammad R. Hasyim, Kyle Michel, Ilyes Batatia, Gábor Csányi, Misko Dzamba, Peter Eastman, Nathan C. Frey, Xiang Fu, Vahe Gharakhanyan, Aditi S. Krishnapriyan, Joshua A. Rackers, Sanjeev Raja, Ammar Rizvi, Andrew S. Rosen, Zachary Ulissi, Santiago Vargas, C. Lawrence Zitnick, Samuel M. Blau, Brandon M. Wood

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a computer to be a master chemist. You want it to predict how molecules behave, how drugs will fit into the human body, or how new batteries will store energy.

For decades, the only way to do this accurately was to use Density Functional Theory (DFT). Think of DFT as a super-precise, super-slow physics engine. It calculates the behavior of every single electron in a molecule. It's like trying to simulate a hurricane by tracking the path of every single raindrop. It's incredibly accurate, but it takes so much computing power that you can only simulate tiny things for a split second.

Machine Learning (ML) offers a shortcut. If you show a computer enough examples of how molecules behave, it can learn the patterns and predict the answer instantly, like a seasoned chef guessing a recipe's taste without measuring every spice. But here's the problem: The computer needs a massive library of recipes to learn from.

Until now, that library was too small, too simple, or too messy. It was like trying to teach a chef to cook a global banquet using only a cookbook with 1,000 recipes for plain toast.

Enter Open Molecules 2025 (OMol25).

The "Encyclopedia of Everything"

The researchers at Meta FAIR and their partners built the OMol25 dataset. Think of this as the "Library of Alexandria" for molecules.

  • The Scale: They didn't just add a few more recipes; they generated 140 million high-precision calculations. That's like filling a library with billions of pages of chemistry.
  • The Diversity: Previous datasets were like a library that only had books about apples. OMol25 has books on apples, elephants, spaceships, and ocean currents. It covers:
    • 83 different elements (almost the whole periodic table).
    • Biomolecules: How proteins and DNA interact (crucial for drug discovery).
    • Metal Complexes: The weird, flexible structures used in catalysts and batteries.
    • Electrolytes: The soupy liquids inside batteries that make them work.
    • Reactivity: Molecules in the middle of breaking apart or joining together (like a car crash in slow motion).

How They Built It: The "Virtual Lab"

You can't just go to a lab and run 140 million experiments; it would take a million years and cost more than the GDP of a small country.

Instead, they built a virtual lab. They used a super-computer cloud (Meta's private cloud) to run these simulations.

  • The Analogy: Imagine a factory that builds toy cars. Usually, they build one car, test it, and move on. With OMol25, they built a factory that builds 140 million cars in different colors, sizes, and conditions (some in the rain, some on fire, some upside down) all at once.
  • The Cost: This required 6.6 billion CPU hours. That's like running a single computer non-stop for 750 years! They did this by using "idle" computers that were sitting around at Meta, turning wasted electricity into scientific gold.

The "Test Drive" (Evaluations)

Just having the data isn't enough; you need to know if the AI actually learned anything. The paper introduces a series of challenge courses to test the AI models:

  1. The "Lock and Key" Test: Can the AI predict how well a drug molecule (the key) fits into a protein (the lock)?
  2. The "Stretch and Snap" Test: Can it predict how much energy is needed to bend a molecule before it breaks?
  3. The "Charge" Test: Can it handle molecules that have gained or lost electrons (like a battery charging)?
  4. The "Spin" Test: Can it predict what happens when the tiny magnetic spins of electrons change?

The Results: A New Era

They trained several AI models on this massive dataset and ran them through the challenge courses.

  • The Winners: Models like UMA and GemNet-OC performed incredibly well. In many areas, they reached "chemical accuracy" (meaning they are almost as good as the slow, expensive physics engine, but millions of times faster).
  • The Gap: While they are great at predicting stable molecules, they still struggle a bit with the most chaotic scenarios, like complex chemical reactions or long-range forces in batteries. This tells scientists exactly where to focus their next round of improvements.

Why This Matters to You

This isn't just about fancy math. This dataset is the foundation for the next generation of technology:

  • Medicine: Designing new drugs that cure diseases without the side effects, by simulating how they interact with your body before ever testing on a human.
  • Energy: Creating better, safer, and longer-lasting batteries for your phone and electric car.
  • Materials: Discovering new materials that are stronger, lighter, or more conductive.

In short: The authors didn't just build a bigger dataset; they built a universal training ground. They gave the AI a "PhD" in chemistry by feeding it a diet of 140 million high-quality examples. Now, the rest of the world can use this data to build AI that helps us solve some of humanity's biggest problems, from curing cancer to saving the climate.