Structural Plausibility Without Binding Specificity: Limits of AI-Based Antibody-Antigen Structure Prediction Confidence Scores

This study demonstrates that while state-of-the-art AI methods can generate geometrically plausible antibody-antigen structures, their internal confidence scores fail to reliably distinguish correct binding pairs from incorrect ones, highlighting a critical need for explicit negative controls and realistic benchmarking in therapeutic discovery.

Original authors: Smorodina, E., Ali, M., Kropivsek, K., Salicari, L., Miklavc, S., Kappassov, A., Fu, C., Sormanni, P., de Marco, A., Greiff, V.

Published 2026-03-03
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to find a specific key that fits a specific lock. You have a bag of 106 real keys and 106 real locks. You know exactly which key opens which lock (the "Real" pairs). But to test your detective skills, you also mix them up randomly, creating thousands of fake pairings where a key is held up to a lock it doesn't belong to (the "Shuffled" pairs).

Your goal is to use a high-tech AI scanner to look at these pairs and say, "Yes, this key fits this lock!" or "No, this is a mismatch."

This paper is about testing three of the most advanced AI scanners available today (AlphaFold3, Boltz-2, and Chai-1) to see if they can actually tell the difference between a real match and a fake one.

The Big Surprise: The AI is "Polite" but Wrong

The authors found that these AI tools are incredibly good at making things look like they fit.

  • The Analogy: Imagine you have a square peg and a round hole. The AI is so good at geometry that it can twist the square peg into a shape that looks like it fits the round hole perfectly. It creates a "plausible" structure.
  • The Problem: Just because the peg looks like it fits doesn't mean it actually turns the lock. The AI generates a beautiful, geometrically sound structure for the fake pairs just as often as it does for the real pairs.

The "Confidence Score" Trap

When you ask these AI tools, "How sure are you that this is a match?" they give you a confidence score (like a grade from 0 to 1).

  • The Reality: The paper shows that these scores are not reliable for telling real matches from fake ones.
  • The Metaphor: It's like a weatherman who is 90% confident it will rain, but it's actually sunny. The AI says, "I'm very confident this key fits!" even when it's holding a key to a completely different house.
    • AlphaFold3 was the "best" of the bunch, but it still failed to distinguish real from fake most of the time.
    • Boltz-2 was "overconfident," giving high scores to almost everything, even the mismatches.
    • Chai-1 was "underconfident," sometimes missing good matches because it didn't trust its own predictions.

The "More Sampling" Myth

A common idea in AI is: "If we run the simulation 100 times instead of once, we'll get a better answer."

  • What the paper found: Running the AI 100 times does make the shape of the key and lock look slightly better (more polished). However, it does not help the AI realize if it's holding the wrong key in the first place.
  • The Analogy: Imagine you are trying to solve a maze. If you run the maze 100 times, you might draw the walls a bit straighter and the path a bit smoother. But if you started in the wrong room, drawing the walls better won't get you to the exit. The AI gets stuck in the "wrong room" (the wrong binding mode) and just refines that mistake.

The Cost of Computing

The researchers also measured how much electricity these AI tools use.

  • The Finding: Running the AI 100 times uses a lot of energy (like leaving a high-powered computer running for hours).
  • The Advice: The paper suggests that running the AI 10 to 25 times is usually enough to get a "good enough" shape. Running it 100 times is mostly a waste of energy because the AI isn't learning anything new about which key fits; it's just polishing the same wrong answer.

The Bottom Line for Drug Discovery

Scientists use these AI tools to design new medicines (antibodies) to fight diseases. They hope to generate thousands of potential drug candidates and use the AI's confidence score to pick the best ones.

  • The Warning: This paper warns that you cannot trust the AI's confidence score alone. If you pick the top 100 "most confident" predictions, you will likely get a mix of real winners and a huge number of "hallucinations" (fake matches that look real but don't work).
  • The Solution: Instead of just trusting the AI's internal score, scientists need to use "negative controls." This means testing the AI against fake, shuffled pairs to see if it can tell the difference. If the AI can't tell the difference between a real match and a fake one, its high confidence score is meaningless.

In short: The AI is a master architect that can build beautiful, plausible-looking castles. But it is currently terrible at knowing which castle is actually built on solid ground and which one is just a mirage. We need better ways to check the foundation before we start building our medicines.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →