Assessment of scoring functions for computational… — Plain-Language Explanation

Original authors: Jacob Sumner, Grace Meng, Naomi Brandt, Alex T. Grigas, Andrés Córdoba, Mark D. Shattuck, Corey S. O'Hern

Published 2026-06-12

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

CC BY 4.0

Original authors: Jacob Sumner, Grace Meng, Naomi Brandt, Alex T. Grigas, Andrés Córdoba, Mark D. Shattuck, Corey S. O'Hern

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to solve a 3D puzzle where two specific pieces (proteins) must snap together perfectly to form a working machine. In the real world, scientists can sometimes take a picture of these pieces already snapped together using powerful microscopes (like X-ray crystallography). But often, they only have the two pieces separately and need to use a computer to figure out exactly how they fit.

This paper is like a report card for the "guessing algorithms" scientists use to solve this puzzle. The researchers asked: How good are these computer programs at picking the correct way the pieces fit together out of millions of wrong guesses?

Here is a breakdown of their findings using simple analogies:

1. The Problem: The "Needle in a Haystack"

When a computer tries to fit two proteins together, it generates thousands of possible positions. Most of these are wrong (like trying to fit a square peg in a round hole). A few are close to the right answer, and one is the perfect "native" fit.

The computer uses a "scoring function" to rank these guesses. Think of the scoring function as a judge that gives each guess a grade. The goal is for the judge to give the highest grade to the perfect fit and low grades to the bad ones.

2. The Old Way vs. The New Way (The Sampling Issue)

Previously, scientists checked if these judges were good by looking at the "Hit Rate." This is like asking: "Did the judge put the correct answer in the top 5 guesses?"

The authors found a major flaw in this method. It's like judging a talent show where the audience only sees the worst 99% of the acts. If the judge picks the "best" of the terrible acts, it looks like a success, even though the judge is terrible at finding the actual star.

The Fix: The researchers created a new method where they forced the computer to generate guesses that were evenly spread out from "terrible" to "perfect."
The Result: When they looked at the data this way, they realized many judges were actually much worse than previously thought. For about half the puzzles, the judges were barely better than random guessing.

3. The "Shape" of the Puzzle

The researchers discovered that some puzzles are just naturally harder to grade than others. They looked at the "landscape" of the puzzle:

Easy Puzzles: Imagine a smooth, deep bowl. If you roll a ball (the protein) anywhere, it naturally rolls to the bottom (the correct spot). The computer can easily tell which way is "down."
Hard Puzzles: Imagine a bumpy, flat plateau with tiny dips everywhere. It's hard to tell which dip is the real bottom. The computer gets confused because the "wrong" spots look almost as good as the "right" spot.

They found that puzzles where the two pieces are tightly intertwined (like two hands clasping) are easier to score. Puzzles where the pieces just touch on a flat surface are harder.

4. A Simpler Judge

The paper tested seven different high-tech "judges" (some based on physics, some on statistics, and some using advanced AI).

The Surprise: The most complex AI judges didn't always win.
The Solution: The authors built a brand new, very simple "judge" based on just two physical rules:
1. How many atoms are touching between the two pieces?
2. How "interlocked" are the shapes?
The Result: This simple judge performed just as well as the most complex, high-tech judges currently in use. It proves that sometimes, understanding the basic physics is more important than using a massive, complicated algorithm.

5. The "Wobbly" Pieces (Flexible Docking)

So far, we assumed the puzzle pieces are rigid (like plastic blocks). But in real life, proteins are like rubber bands; they wiggle and change shape when they come together.

The researchers tested what happens when the pieces are slightly deformed (stretched or bent) before they try to fit.
The Finding: As the pieces get more "wobbly" (further from their perfect shape), the judges get terrible at their job. The correlation between the score and the correct answer drops sharply. It's like trying to grade a puzzle where the pieces keep changing shape while you are looking at them.

Summary

This paper tells us that:

We need to stop using old methods to test if our protein-fitting software is working; we need to test it on a fair, balanced set of guesses.
Some protein pairs are just harder to predict than others, depending on how "bumpy" or "flat" their meeting spot is.
You don't always need a super-complex AI to solve this; a simple model based on how many atoms touch and how interlocked the shapes are works just as well as the current state-of-the-art tools.
If the proteins change shape (flex), our current tools struggle significantly, highlighting a major area for future improvement.

Technical Summary: Assessment of Scoring Functions for Computational Models of Protein-Protein Interfaces

Problem Statement
The primary goal of computational studies on protein-protein interfaces (PPIs) is to predict the binding site between two monomers forming a heterodimer. While rigid-body re-docking (reconstructing the bound complex from known bound conformations) is often considered a "solved" problem with high hit rates, the accuracy of scoring functions in distinguishing near-native models from low-quality decoys remains inconsistent. Previous assessments relied heavily on rank-based metrics (e.g., "hit rate") or classification metrics (e.g., Area Under the ROC Curve, AUC). However, these methods are highly sensitive to the sampling distribution of computational models. In standard datasets, near-native models are vastly outnumbered by low-quality models, leading to inflated classification scores (high AUC) that do not reflect the actual monotonic correlation between the scoring function and structural accuracy (DockQ). Furthermore, existing studies often fail to evaluate scoring performance on a per-target basis or account for the conformational flexibility of monomers.

Methodology
The authors developed a rigorous framework to assess seven state-of-the-art PPI scoring functions (VoroMQA, ITScorePP, PyDock, Rosetta, ZRank2, Deeprank-GNN-ESM, and GNN-DOVE) and a new Support Vector Regression (SVR) model based on physical features.

Dataset Construction:
- Primary Dataset: 84 high-resolution heterodimer x-ray crystal structures from the Protein Data Bank (PDB), filtered for resolution ( $\le$ 3.5 Å), lack of non-protein polymers, and sequence similarity ( $<$ 20%).
- Validation Dataset: 62 non-redundant targets from the ZDOCK Benchmark 5.5.
- Flexible Docking Subset: 15 heterodimers with available unbound monomer structures to study conformational changes.
Uniform Sampling Strategy:
- Instead of relying on standard random sampling (which yields a bias toward low-quality models), the authors generated 540,000 rigid-body re-docked models per target using ZDOCK 3.0.2.
- These models were subsampled to create a dataset where models are uniformly distributed across the full range of the DockQ metric (a continuous measure of structural similarity to the native structure, $0 \le$ DockQ $\le$ 1).
- For each target, approximately 1,000 models were selected, balanced to include equal numbers of "positive" (DockQ $\ge$ 0.23) and "negative" models.
Evaluation Metrics:
- Spearman Correlation ( $\rho$ ): Calculated between the scoring function output and the DockQ score for each target. This measures the monotonic relationship between the predicted score and structural accuracy.
- Hit Rate & AUC: Calculated to demonstrate their sensitivity to sampling biases.
- Physical Feature Analysis: The authors quantified interface separability ( $S$ ) using a polynomial SVM and the number of interface contacts ( $N_c$ ) based on heavy atom distances.
Flexible Docking Extension:
- To address flexible-body docking, the authors generated intermediate monomer conformations by linearly interpolating between unbound and bound forms. They performed rigid-body docking on these deformed conformations to assess how scoring performance degrades as the Root-Mean-Square Deviation (RMSD) from the bound state increases.

Key Contributions and Results

Re-evaluation of Scoring Metrics:
- The study demonstrates that hit rates and AUC values are unreliable when model quality is not uniformly sampled. For example, a dataset can yield an AUC of 0.959 while the Spearman correlation ( $\rho$ ) is only $\approx$ 0.256.
- When models are uniformly sampled over DockQ, a linear relationship emerges: $AUC \approx -0.5\rho + 0.5$ . This allows for consistent, target-specific evaluation of scoring functions using $\rho$ .
Performance of Current Scoring Functions:
- Under uniform sampling, current scoring functions perform poorly for a significant portion of targets. Approximately 50% of targets exhibit $|\rho| < 0.70$ .
- Only about 25% of targets are classified as "easy" ( $|\rho| > 0.8$ ), while 23 targets are "hard" ( $|\rho| < 0.6$ ).
- Among the tested functions, ZRank2 showed the highest average correlation ( $|\langle\rho\rangle| \approx 0.78$ ), while VoroMQA performed best only on easy targets ( $|\langle\rho\rangle| \approx 0.56$ ). Machine learning-based GNN methods (Deeprank-GNN-ESM, GNN-DOVE) showed mixed results, with GNN-DOVE failing to consistently score high-DockQ models.
Physical Determinants of Scoring Difficulty:
- The authors identified two key physical features that correlate with scoring difficulty:
  - Interface Separability ( $S$ ): A measure of geometric complementarity. Lower $S$ (more intertwined interfaces) correlates with higher $|\rho|$ (easier scoring).
  - Number of Interface Contacts ( $N_c$ ): Higher $N_c$ correlates with higher $|\rho|$ .
- Targets with highly intertwined monomers and many interface contacts are easier to score. Conversely, targets with flat, separable interfaces and fewer contacts are difficult.
- The "DockQ landscape" (the distribution of structural similarity in configuration space) is more isotropic for easy targets, whereas difficult targets exhibit anisotropic landscapes where the scoring function fails to overlap with the native basin.
Development of a Simplified Scoring Model:
- Using only the two physical features ( $S$ and $N_c$ ), the authors constructed a simple SVR model. This model matched or exceeded the performance of the most accurate existing scoring functions (like ZRank2), suggesting that complex energy functions may not be necessary if key physical descriptors are correctly weighted.
Impact of Flexibility:
- In the flexible docking experiments, the Spearman correlation between PPI scores and DockQ decreased strongly as the monomers were deformed from their bound conformations (increasing iRMSD). This highlights that current scoring functions struggle significantly when the input monomers are not in their native bound conformations.

Significance and Claims
The paper claims that the field of PPI scoring has been misled by metrics (hit rate, AUC) that are sensitive to sampling biases. By implementing uniform sampling over DockQ, the authors reveal that current scoring functions are far less accurate than previously thought, particularly for targets with specific interface geometries (low contact count, high separability).

The significance of this work lies in:

Establishing a more reliable, correlation-based metric ( $\rho$ ) for evaluating scoring functions on a per-target basis.
Identifying that the difficulty of scoring is physically rooted in interface geometry (separability and contact density) rather than just the complexity of the scoring function.
Demonstrating that a simple model based on two physical features can rival complex, state-of-the-art functions, suggesting a path toward more interpretable and effective scoring functions.
Highlighting the critical challenge of flexible docking, where scoring accuracy degrades rapidly as monomer conformations deviate from the bound state, indicating a need for methods that incorporate more discriminating physical features or better handle conformational ensembles.

The authors conclude that improving PPI docking predictions requires focusing on the correlation between scores and structural similarity (DockQ) and integrating specific, discriminating physical features into scoring functions, rather than relying solely on current benchmarking practices.

Assessment of scoring functions for computational models of protein-protein interfaces