Original authors: Ibrahim Elsharkawy, Vinicius Mikuni, Wahid Bhimji, Benjamin Nachman

Published 2026-05-05

📖 4 min read🧠 Deep dive

Original authors: Ibrahim Elsharkawy, Vinicius Mikuni, Wahid Bhimji, Benjamin Nachman

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have two very different worlds: one is the chaotic, high-speed world of particle physics (where scientists smash atoms together to see what flies out), and the other is the intricate, sticky world of molecular chemistry (where atoms stick together to form medicines, materials, and life).

For a long time, scientists in these two fields used completely different tools to understand their worlds. But in this paper, the authors introduce OmniMol, a new tool that tries to teach the particle physics experts to understand chemistry by using a "foundation model" they already built.

Here is the simple breakdown of how they did it and what they found:

1. The "Master Chef" Analogy

Think of the original model, called Omnilearned, as a master chef who has spent years cooking with particle jets.

The Ingredients: In particle physics, a "jet" is a spray of subatomic particles (like protons and neutrons) flying out of a collision.
The Skill: This chef learned to recognize patterns in these sprays. They know how particles interact, how they cluster, and how to predict what happens next. They were trained on one billion different particle sprays.

Now, the authors asked: Can this same chef cook a molecular meal?

The New Ingredients: Instead of subatomic particles, the "ingredients" are atoms (like Carbon, Oxygen, Hydrogen) in a molecule.
The Challenge: Atoms behave differently than subatomic particles, but they share a similar structure: they are just points in space with specific types.

2. The "Universal Translator" (The Architecture)

To make this work, they didn't build a new chef from scratch. They took the existing "Master Chef" (Omnilearned) and gave them a new set of tools:

The Point-Edge Transformer (PET): Imagine the chef looking at a plate of food. Instead of just looking at one ingredient at a time, this tool lets them look at every ingredient at once and see how every single one relates to every other one.
The "Physics Bias": This is the secret sauce. The model has a built-in "rulebook" that tells it, "Hey, these two particles/atoms are close together, so they should pay more attention to each other." This helps the model focus on the most important relationships without getting confused by the noise.

3. The Experiment: Fine-Tuning

The authors took this particle-trained model and gave it a "crash course" in chemistry using a dataset called oMol (a collection of millions of molecules).

The Goal: They wanted the model to act as a Machine-Learned Interatomic Potential (MLIP). In plain English, this means the model needs to predict two things for any group of atoms:
1. Energy: How much "glue" holds them together?
2. Force: If you push one atom, how hard will it push back?

4. The Results: Fast and Surprisingly Good

The paper found some exciting things:

The "Few-Shot" Superpower: Usually, teaching a computer chemistry requires massive amounts of data. But because OmniMol started with the "knowledge" of particle physics, it learned chemistry very quickly. Even with a relatively small amount of new data (like 100,000 molecules), it performed almost as well as models trained on millions. It's like a master chef who can learn a new cuisine with just a few recipes because they already understand the basics of flavor and heat.
Speed: OmniMol is incredibly fast. While other models might take a long time to calculate how a molecule moves, OmniMol does it in the blink of an eye. The authors note that for every hour of computing time, OmniMol can simulate three times more molecules than some of its competitors.
The Trade-off: When they had huge amounts of data (millions of molecules), the advantage of starting with particle physics knowledge faded a bit. This suggests that the "particle physics knowledge" acts like a strong head-start, but if you have enough time and data to train a model from scratch, that head-start matters less.

5. The Big Picture

The paper concludes that OmniMol is the first time a "foundation model" built for one scientific discipline (particle physics) has been successfully transferred to a completely different one (chemistry).

They proved that if you have a smart model that understands how points in space interact in one field, it can be adapted to understand how points in space interact in another field, saving time and computing power.

In summary: The authors took a super-smart AI trained on high-energy particle crashes, tweaked its brain to understand atoms instead of particles, and found that it became a lightning-fast, highly accurate tool for predicting how molecules behave, especially when data is scarce.

Technical Summary: OmniMol

Problem Statement

Machine learning (ML) has transformed the representation and simulation of complex physical systems, particularly in particle physics and molecular chemistry. While these domains differ vastly in energy scales, they share a fundamental data structure: variably-sized sets of particles (or atoms) in phase space, effectively forming structured point clouds.

The primary challenge addressed is the development of efficient Machine-Learned Interatomic Potentials (MLIPs). Traditional methods like Density Functional Theory (DFT) are computationally expensive, limiting large-scale and long-horizon molecular dynamics (MD) simulations. MLIPs aim to approximate potential energy surfaces and forces at a fraction of this cost. However, training robust MLIPs typically requires massive datasets and significant computational resources. The paper hypothesizes that a foundation model pre-trained on point clouds in particle physics (specifically particle jets) could be transferred to molecular dynamics, potentially accelerating optimization and improving accuracy in low-data regimes.

Methodology

Architecture: Point-Edge Transformer (PET)

OmniMol is built by adapting Omnilearned, a foundation model originally designed for classifying and generating particle jets in high-energy physics (HEP). The core architecture is a Point-Edge Transformer (PET), which couples local attention over $k$ -nearest neighbors with global all-to-all transformer blocks.

Key architectural components include:

Input Embeddings: Atoms are embedded into a token space combining positional information ( $\vec{r}$ ), discrete atomic numbers ( $Z$ ), and additional features (charge, spin).
Local Attention Block: For each atom, a local neighborhood is constructed using $K$ -nearest neighbors ( $K=15$ for molecules, compared to $K=10$ for jets). Pairwise physical features are computed, including distance terms, inverse powers of distance, and learned functions of atomic embeddings. These are processed by a small local transformer to create a local embedding vector.
Global Attention with Interaction Bias: The global self-attention mechanism incorporates an explicit bias derived from pairwise physical features. The attention logits are modified as $A^*_{ij} = A_{ij} + B_{ij}$ , where $B_{ij}$ is an MLP-embedded bias term. This "interaction-matrix attention bias" injects pairwise physics priors directly into the transformer, steering the network toward physically meaningful neighborhoods without sacrificing expressivity.
Output Heads: The generative head of Omnilearned is repurposed for two tasks:
- Force Prediction: A permutation-equivariant head predicting per-atom forces.
- Energy Prediction: A head predicting per-atom energy corrections, which are summed to yield the total molecular energy, preserving extensive priors.

Invariance and Conservation Constraints

To satisfy physical constraints, the authors address two requirements:

Energy Conservation: Forces are not predicted directly but are computed via backpropagation of the energy output ( $\vec{F}_i = \nabla_{\vec{r}_i} E$ ). This ensures exact energy conservation but increases computational cost during training (requiring double backpropagation). Consequently, this constraint is applied only to the "small" model variant.
Rotational Equivariance: The standard architecture is not inherently equivariant because raw coordinate differences are fed into MLPs. To remedy this, the authors introduce an "equivariant and conservative" variant. This version removes direct coordinate difference terms from the pairwise features and instead incorporates angular information (cosines of angles formed by vectors between neighboring atoms) into the local block. This modification retains equivariance while significantly recovering performance losses associated with removing coordinate terms.

Training and Fine-Tuning Strategies

The model is fine-tuned on the oMol dataset (specifically oMol-25, oMol-4M, oMol-100M, and oMol-140M subsets). Two strategies are explored:

LoRA (Low-Rank Adaptation): The pre-trained PET backbone weights are frozen. Low-rank adapters are introduced only for the transformer body matrices ( $W_Q, W_K, W_V, W_O, W_{MLP}$ ), alongside training of the molecular input encoders, the bias MLP, and the task heads. An "embedding adapting" layer is also added to modify learned embeddings.
Full Fine-Tuning: All weights in the body and input encoders are unfrozen and trained, while the task heads are trained from scratch.

The training objective minimizes the sum of Mean Absolute Errors (MAE) for energies and forces, with forces weighted more heavily ( $\lambda_F = 10$ ).

Key Results

Performance on oMol

Full Training: When trained on large datasets (oMol-4M and oMol-100M/140M), OmniMol achieves competitive performance with state-of-the-art MLIPs. For instance, on oMol-140M, the OmniMol-large model achieves an energy MAE of 1.04 meV/atom and force MAE of 13.59 meV/Å.
Low-Data Regime: The most significant gains are observed when training data is limited. When fine-tuned on only 100k molecules or with very few epochs (2 passes) over oMol-4M, the pre-trained OmniMol variants significantly outperform models trained from scratch.
- On a 100k subset, pre-training improved energy MAE by up to 29.4% and force MAE by 26.9% for the medium model.
- With only two epochs on oMol-4M, the medium model showed a 54.6% improvement in energy MAE and 56.9% in force MAE compared to its non-pre-trained counterpart.
Equivariant/Conservative Variant: The equivariant and conservative model variant shows significantly improved performance (especially for forces) in low-data regimes, though this advantage diminishes as the dataset size increases.

Scaling and Inference Speed

Scaling: OmniMol follows clean power-law scaling with model size, showing no signs of saturation up to 1 billion parameters, consistent with recent findings on transformer-based MLIPs.
Inference Speed: Despite large parameter counts, OmniMol demonstrates uniquely fast inference speeds due to hardware optimizations for transformers. On an A100 GPU for systems of ~100 atoms, OmniMol-medium is approximately 3x faster than comparable Graph Neural Network (GNN) baselines (eSEN-md-d and AllScAIP-md) while maintaining competitive accuracy (only ~0.7 meV/atom higher energy error than AllScAIP-md).

Significance and Claims

The paper claims to present the first demonstration of cross-discipline transfer for scientific point-cloud foundation models. By adapting a model pre-trained on high-energy physics particle jets to molecular dynamics, the authors demonstrate that:

Cross-Domain Transfer is Viable: A foundation model built for particle physics can effectively transfer to molecular chemistry, suggesting that the underlying point-cloud structures share learnable features across vastly different physical scales.
Inductive Bias Accelerates Learning: The pre-training acts as a strong inductive bias. Similar to how equivariance helps when data is scarce, the "bitter lesson" of pre-training allows for rapid optimization and improved accuracy when training data is limited.
Efficiency: The architectural transfer enables uniquely fast inference speeds, which is critical for applications requiring rapid exploration of design spaces, such as small molecule drug discovery.

The authors conclude that while the study focuses on MLIPs, the lessons regarding point-cloud foundation models may have widespread utility across scientific domains where systems are described as unordered sets of interacting bodies. They do not claim universal superiority over all existing methods in all regimes but highlight the specific advantages in low-data scenarios and inference speed.

OmniMol: Transferring Particle Physics Knowledge to Molecular Dynamics with Point-Edge Transformers