Physics-Grounded Evaluation to Guide Accurate Biomolecular Prediction

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: The "Perfect" Protein Painter

Imagine you have a master painter who can look at a list of ingredients (a protein's genetic code) and instantly paint a perfect 3D sculpture of a protein. For a few years, this painter (called AlphaFold) has been hailed as a miracle. They can predict the shape of almost any protein in the human body with stunning accuracy.

Scientists are now excited because they think, "If we know the shape, we can figure out how the protein works." They want to use these paintings to design new drugs, fix broken enzymes, and cure diseases.

But this new paper asks a scary question:

"Just because the painting looks right from a distance, does every single brushstroke actually make sense physically?"

The authors, a team of physicists and biologists, decided to stop looking at the "big picture" and start inspecting the tiny details. They built a new kind of magnifying glass to check if the painter actually understands the laws of physics or if they are just memorizing patterns.

The Problem: Measuring the Wrong Thing

The Old Way (The Tape Measure):
Previously, scientists checked the painter's work by measuring the distance between atoms. It's like checking a sculpture by measuring the distance from the tip of the nose to the tip of the ear. If the distance is close to the real thing, the sculpture gets a high score.

The Flaw: You can have the right distance between the nose and ear, but if the ear is made of jelly and the nose is made of lead, the sculpture won't work in real life. The old method didn't care about how the atoms were holding hands, only where they were standing.

The New Way (The Physics Inspector):
This paper introduces a new evaluation method. Instead of just measuring distances, they check the energetic rules.

The Analogy: Imagine a dance. The old method just checked if the dancers were standing in the right spots on the floor. The new method checks if they are actually holding hands correctly, if their knees are bent at a natural angle, and if they aren't trying to dance through each other's bodies (which would be physically impossible).

What They Found: The Painter is Good, But Flawed

The team tested the top three "painters" (AlphaFold2, AlphaFold3, and ESMFold) against real-life protein sculptures found in nature (from the PDB database). They looked at 3.4 million tiny interactions.

Here is what they discovered:

1. The Painter Knows the Basics

The models are great at the "skeleton." They know where the backbone of the protein goes and can predict the general shape very well. They understand that atoms generally want to be close to each other but not too close.

2. The "Side-Chain" Mistakes (The Fingers and Toes)

Proteins have a backbone and "side chains" (like fingers and toes sticking out). These side chains are what actually grab onto other molecules to do work.

The Finding: The models are getting the location of the fingers right, but they are often twisting the fingers the wrong way.
The Analogy: Imagine a hand. The model puts the hand in the right spot, but it twists the thumb so it's pointing backward, or bends the pinky finger into a painful, unnatural position.
The Stats:
- AlphaFold (2 & 3): About 30% of the side-chain interactions are "twisted" incorrectly.
- ESMFold: About 60% are wrong.
- The Consequence: If you try to use these models to design a drug that fits into a protein's "hand," the drug might not fit because the fingers are bent the wrong way.

3. The "Hallucinations"

Sometimes, the models invent interactions that don't exist in reality.

The Analogy: It's like the painter deciding, "I think this protein needs a hydrogen bond here," and drawing a connection between two atoms that are actually too far apart to ever touch. They are "hallucinating" connections that physics says are impossible.

4. The "Frozen" Ensemble

Proteins aren't statues; they wiggle and dance. They exist as a cloud of many possible shapes (an ensemble).

The Finding: The models tend to predict just one rigid shape. They are like a photographer taking a single, frozen snapshot, whereas the real protein is a video of a dancer moving.
The Issue: If a protein needs to wiggle to catch a virus, a model that predicts it as a stiff statue won't help us understand how it works.

The "Relaxation" Fix (And Why It's Not Enough)

The authors tried a trick: they took the models' predictions and ran them through a "physics simulator" (called force field relaxation) to see if the atoms would naturally settle into a better position.

The Result: It helped a little! It fixed some of the twisted fingers.
The Catch: It didn't fix everything. About 20% of the errors remained, and sometimes the simulator created new fake connections. It's like trying to fix a crooked picture frame by shaking it; it might straighten a bit, but the frame is still fundamentally warped.

The "Common Enemy"

Interestingly, even though AlphaFold2 and AlphaFold3 use very different computer architectures (like two different artists using different brushes), they made almost the exact same mistakes.

The Takeaway: This suggests the problem isn't just the software code; it's that the models haven't truly learned the deep, underlying laws of physics. They are still mostly "guessing" based on patterns they've seen before, rather than understanding why atoms behave the way they do.

Why Does This Matter?

If you are a doctor or a drug designer, you might be tempted to say, "Well, the shape looks 90% right, that's good enough!"

This paper says: No, it's not.

The Analogy: If you are building a bridge, being 90% right about the shape of the steel beams is useless if the bolts are twisted the wrong way. The whole bridge could collapse.
The Future: To truly predict how proteins work (to cure diseases, design new materials), the next generation of AI models needs to stop just memorizing shapes and start learning the rules of physics. They need to understand energy, probability, and how atoms actually push and pull on each other.

Summary in One Sentence

This paper reveals that while AI models are amazing at drawing the "outline" of proteins, they are still making frequent, physics-breaking mistakes with the tiny details, which could lead to failures when trying to use these models for real-world medicine and drug discovery.

1. Problem Statement

Deep learning models like AlphaFold2, AlphaFold3, and ESMFold have revolutionized protein structure prediction. However, their application to predicting biomolecular function (e.g., ligand binding, enzyme catalysis, allostery) remains limited.

The Core Issue: Current evaluation metrics rely heavily on Cartesian distance-based measures (e.g., RMSD, pLDDT). These metrics compare atomic coordinates in 3D space but fail to capture the underlying physical rules (energetics, probabilities of interactions) that govern biomolecular behavior.
The Gap: It is unknown whether these models have truly learned the physical laws of atomic interactions or if they are merely interpolating structural patterns. Consequently, models may produce structures with low RMSD but incorrect interaction networks, rendering them unreliable for functional prediction tasks like ligand docking.

2. Methodology: A Physics-Grounded Evaluation Framework

The authors developed a novel evaluation framework that shifts the focus from coordinate distances to molecular interactions and their energetics.

Evaluation Metrics: Instead of RMSD, the study uses metrics directly mapped to energetic properties:
- Covalent bond lengths, angles, and torsions.
- Hydrogen bond lengths and angles.
- Van der Waals (vdW) distances.
- Energy Deviation ( $\Delta E$ ): Using a knowledge-based energy function derived from crystallographic data (PDB), the authors convert torsion angles and interaction geometries into energy values (kcal/mol) to assess stability.
Dataset:
- Reference: 3,949 high-resolution X-ray crystal structures (Top2018 dataset, resolution < 2 Å).
- Models Evaluated: AlphaFold2 (AF2), AlphaFold3 (AF3), and ESMFold.
- Scale: Analysis of >3.4 million molecular interactions across buried residues (solvent accessibility < 25%).
Experimental Design:
- One-to-One Comparisons: Comparing predicted interactions directly against experimental PDB structures for identical sequences.
- Baseline Models: To quantify model performance, the authors created "context-free" baselines:
  1. Random Rotamer Sampling: Randomly selecting rotameric states (gauche-, trans, gauche+).
  2. Random Torsion Sampling: Randomly sampling angles within a specific rotameric well based on PDB probabilities.
- Ensemble Analysis: Comparing AF3's diffusion-based sampling (using random seeds) against experimental multi-temperature (MT) X-ray crystallography data to assess the ability to capture conformational ensembles.
- Relaxation Tests: Evaluating the impact of AMBER force field relaxation steps on AF2 and AF3 predictions.

3. Key Contributions

New Evaluation Paradigm: Established a physics-grounded framework that evaluates models based on interaction probabilities and energetics rather than just geometric proximity.
Systematic Error Quantification: Identified pervasive, system-wide biases in state-of-the-art models that were previously undetected by RMSD-based metrics.
Model Comparison: Provided a granular comparison of AF2, AF3, and ESMFold, revealing that despite architectural differences, they share fundamental limitations in learning non-covalent interaction energetics.
Ensemble Limitations: Demonstrated that current models generate highly restricted conformational ensembles, failing to reproduce the dynamic nature of proteins observed in experimental data.

4. Key Results

A. Basic Energetic Properties are Captured, but Biased

Covalent Bonds: Models reproduce general patterns (e.g., Ramachandran distributions, staggered vs. eclipsed preferences) but show systematic deviations.
- Bond lengths differ by ~0.01–0.03 Å.
- Bond angles differ by ~1–3°.
- Model distributions are consistently narrower than experimental data, suggesting over-confidence in specific geometries.
Non-Covalent Interactions:
- Hydrogen Bonds: AF3 predicts more bent geometries than the linear bonds found in PDB. The distributions are broader than experimental data.
- Van der Waals: Models show biases in preferred inter-atomic distances.

B. Widespread Prediction Errors in Side-Chain Interactions

Rotameric States: While AF3 correctly predicts the rotameric well (e.g., trans vs. gauche) for ~73–94% of side-chains, it fails to capture the precise local energetic balance within that well.
Interaction Misassignment:
- ~30% of side-chain non-covalent interactions (hydrogen bonds and vdW) are misassigned (wrong partners) in AF2 and AF3.
- ~60% of these interactions are misassigned in ESMFold.
- Models also "hallucinate" interactions (predicting bonds that do not exist in the PDB), particularly for side-chain pairs.
Geometric Deviations: Even when the correct partners are predicted, ~39% of hydrogen bonds and ~32% of vdW contacts deviate significantly from experimental geometries (>0.2 Å or >20°).

C. Common Errors Across Architectures

Shared Deficiencies: Despite AF2 using a rigid-body frame and AF3 using an all-atom diffusion approach, ~55% of rotameric errors and ~77% of missing hydrogen bonds are common to both models.
Implication: These errors stem from fundamental limitations in learning physical rules from training data, not just specific architectural flaws.

D. Ensemble Generation Failure

Restricted Sampling: When sampling multiple conformations via random seeds, AF3 produces nearly deterministic results.
Comparison to MT Data: In experimental multi-conformer models (9 proteins), AF3 predicted only a single conformer for 96 out of 136 residues that exhibited alternative states in the PDB.
Conclusion: Current models cannot reliably reproduce the conformational ensembles necessary for understanding dynamic biological processes.

E. Impact of Force Field Relaxation

Applying AMBER relaxation to AF2/AF3 structures partially rescues missing interactions (e.g., reducing missed side-chain H-bonds from 48% to 24% in AF2).
However, relaxation introduces new "hallucinated" interactions and fails to correct >20% of the errors, indicating that simple post-processing is insufficient.

5. Significance and Future Directions

Limitations of Current Models: The study proves that high RMSD accuracy does not equate to accurate functional prediction. The pervasive errors in side-chain interactions explain why these models struggle with ligand docking and mutational effect prediction.
Guidance for Development:
- Data Quality: Future models need training on higher-quality data, including ensemble data (MT X-ray, Cryo-EM) to learn energy landscapes rather than single static states.
- Training Objectives: Loss functions must incorporate physical energy terms and interaction probabilities, moving beyond distance-based metrics.
- Evaluation Standard: The proposed framework should become the standard for evaluating any model generating atomic-level structural information.
Ultimate Goal: To transition AI models from "memorizing" structural data to "learning" universal physical rules, enabling extrapolation to predict functions for systems where training data is scarce.

In summary, this paper provides a critical "reality check" for the field of biomolecular AI, demonstrating that while models have mastered the "skeleton" of protein structure, they have yet to fully master the "muscle and chemistry" (energetics and dynamics) required for accurate functional prediction.