Contrastive learning for antibody-antigen sequence-to-specificity prediction

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine the human immune system as a massive, high-tech library. Inside this library are billions of unique "keys" (antibodies) designed to find and lock onto specific "locks" (antigens, like viruses or bacteria).

The big challenge scientists face is this: If you give them a picture of a lock, can they instantly tell you which key fits it? And if you give them a key, can they tell you exactly which lock it opens?

Currently, computers are terrible at this. They can guess the shape of a lock, or they can design a key if they already know the lock's shape, but they can't easily look at the raw "blueprint" (the amino acid sequence) of a key and a lock and say, "Yes, these two belong together."

This paper introduces a new AI tool called CALM (Cross-attention Adaptive Immune Receptor–Antigen Language Model) to solve this problem. Here is how it works, explained simply:

1. The Problem: The "Lost in Translation" Issue

Think of antibodies and antigens as two people speaking different languages. One speaks "Antibody," the other speaks "Antigen." For decades, scientists have tried to build a dictionary to translate between them, but the languages are too complex and the dictionary is too big.

Existing methods are like trying to build a 3D model of the lock and key from scratch every time. It's slow, expensive, and often gets the details wrong.

2. The Solution: CALM as a "Universal Translator"

CALM is like a super-smart translator that doesn't care about the 3D shape of the lock or key. Instead, it looks at the text (the sequence of letters) that makes them up.

It uses a technique called Contrastive Learning. Here is a simple analogy:

Imagine you are teaching a dog to recognize its owner.
You show the dog a photo of its owner and a photo of a stranger.
You say, "This is the owner (Good!)" and "This is not (Bad!)."
Over time, the dog learns to pull the "Owner" photo and the "Stranger" photo far apart in its mind.

CALM does this with millions of antibody-antigen pairs. It learns to pull the "matching" pairs (the key and its lock) close together in a digital space, and push the "non-matching" pairs far apart.

3. How CALM Works: The "Two-Door" System

CALM has two main parts (encoders):

The Antibody Door: Reads the antibody's sequence.
The Antigen Door: Reads the antigen's sequence.

When you feed them a pair, they translate both into a secret code (an "embedding"). If the pair is a match, their secret codes end up right next to each other in a giant digital room. If they don't match, they end up on opposite sides of the room.

The Cool Trick: Because it learns this "room" so well, you can walk in from either side!

Forward: Give it an antibody, and it finds the matching antigen.
Reverse: Give it an antigen, and it finds the matching antibody.

4. The "Zoom-In" Feature

The researchers also tried a clever trick. Antibodies are long strings of letters, but only a tiny middle section (the "paratope") actually touches the antigen (the "epitope"). The rest is just structural support.

They taught CALM to zoom in and only look at those specific touching letters, ignoring the rest.

Analogy: Imagine trying to recognize a couple by looking at their whole bodies in a crowded stadium. It's hard. But if you zoom in and only look at their hands holding each other, it becomes much easier to tell they are a pair.
Result: When CALM focused only on the "holding hands" parts, it got even better at finding matches.

5. The Results: A Small but Mighty Step

The team tested CALM on a dataset of about 4,000 known pairs. They made the test very hard by hiding the test antigens from the training data (so the AI couldn't just memorize the answers; it had to actually understand the rules).

The Score: In the hardest test, CALM could find the correct match as the #1 choice about 7% of the time.
Why that's huge: If you were guessing randomly in a crowd of that size, you'd only get it right less than 1% of the time. CALM is three times better than random guessing.
The Potential: While 7% sounds low, in the world of AI and biology, this is a massive breakthrough. It proves that an AI can learn the "grammar" of how antibodies and antigens talk to each other just by reading their sequences.

Why This Matters for the Future

Right now, finding a new drug takes years of lab work.

Today: Scientists grow cells, test thousands of samples, and hope to find a match.
With CALM (in the future): Scientists could type in a virus sequence, and CALM could instantly suggest the top 100 antibodies that might stop it. Or, they could type in an antibody they have, and CALM could tell them exactly what disease it fights.

The Bottom Line

This paper is the "Hello World" of a new era. CALM isn't perfect yet—it's like a child who has just learned to speak a new language. It makes mistakes, and it needs more practice (more data). But it has proven that the language of the immune system can be learned by a computer, opening the door to designing life-saving drugs faster and reading the immune system's secrets like never before.

1. Problem Statement

The central challenge addressed is the "sequence-to-specificity" problem: predicting which antibodies bind to which antigens directly from primary amino acid sequences, without relying on 3D structural data.

Current Limitations: Existing methods struggle to map antibodies to cognate epitopes (and vice versa) at a repertoire and proteome scale.
- Structure-based design (e.g., AlphaFold 3, RFdiffusion) can generate binders for known epitopes but does not solve the reverse retrieval task (finding the epitope for a given antibody).
- Protein Language Models (PLMs) (e.g., ESM-2, AntiBERTy) learn structural and biophysical features but lack a unified framework for bidirectional binding specificity.
The Goal: To develop an Immune Specificity Foundation Model (ISFM) capable of bidirectional retrieval (Antibody $\to$ Antigen and Antigen $\to$ Antibody) and eventually generative design, operating purely on sequence data.

2. Methodology: CALM Architecture

The authors introduce CALM (Cross-attention Adaptive Immune Receptor–Antigen Language Model). The current work focuses on Stage 1: a dual-encoder contrastive learning framework.

A. Architecture

Dual-Encoder Design:
- Antibody Encoder: Initialized with AntiBERTy (specialized for immune repertoires). Processes Heavy (VH) and Light (VL) chains separately, then concatenates embeddings.
- Antigen Encoder: Initialized with ESM-2 (general protein language model).
Cross-Attention Decoder (Proposed, not trained): The authors propose an autoregressive decoder with cross-attention to enable generative translation (e.g., generating an epitope sequence given an antibody). This is conceptualized to unify discriminative alignment and generative design but is not implemented or evaluated in this study.
Embedding Space: The encoders project sequences into a shared latent space where cognate binding pairs are pulled closer together, and non-binders are pushed apart.

B. Training Strategy

Contrastive Learning: Uses a symmetric multi-positive InfoNCE loss (similar to CLIP).
- Objective: Maximize cosine similarity between true binder pairs (Paratope-Epitope) while minimizing similarity to negative pairs within the batch.
- Multi-Positive Handling: Accounts for cases where multiple different antibodies bind the same epitope (or vice versa) within a batch.
Masking: To focus on binding interfaces, the model applies binary masks derived from structural data (residues within 5 Å of the binding partner). This creates "Paratope" and "Epitope" specific inputs, filtering out non-interacting sequence noise.
Optimization: Trained using AdamW with a cosine annealing scheduler and warm restarts.

C. Data Curation & Leakage Control

Dataset: 4,138 curated antibody-antigen pairs from the SAbDab database (PDB-derived).
Strict Leakage Control: To prevent data leakage (where test sequences are too similar to training sequences), the dataset was split using MMseqs2 clustering based on antigen sequence identity.
- Out-of-Distribution (OOD) Splits: Test sets contained antigens with only 40%, 60%, or 80% sequence identity to training antigens.
- In-Distribution (ID) Splits: Clustering based on antibody identity (90%/95%) to test generalization across antibody variants while keeping antigens familiar.
Baselines: Random shuffling (RS) and Unclustered (UC) datasets were used for comparison.

3. Key Contributions

First Sequence-Native Bidirectional Model: CALM is the first framework to treat antibody-antigen recognition as a "molecular translation" task, enabling retrieval in both directions (Ab $\to$ Ag and Ag $\to$ Ab) using only sequence data.
Contrastive Co-Embedding: Successfully aligns antibody and antigen representations in a shared space using contrastive learning, demonstrating that binding specificity can be learned without explicit 3D structure inference during deployment.
Paratope-Epitope Focus: Demonstrated that restricting inputs to binding interface residues (via masking) significantly improves retrieval accuracy by reducing sequence-level noise.
Theoretical Insight: The authors propose a theoretical link between the mathematics of transformer attention (Boltzmann distribution) and immune clonal selection, suggesting that contrastive learning on binding data may be more data-efficient than standard deep learning scaling laws.

4. Key Results

Performance was evaluated using Recall@k (R@k), specifically R@1, R@5, and R@10.

Out-of-Distribution (Antigen Clustering):
- At the strictest split (40% identity), CALM achieved R@1 $\approx$ 2% and R@10 $\approx$ 9%, significantly outperforming random baselines (R@1 $\approx$ 0.6%).
- At 80% identity, performance improved to R@1 $\approx$ 6-7% and R@10 $\approx$ 16-19%.
- Directional Symmetry: Performance was consistent in both Ab $\to$ Ag and Ag $\to$ Ab directions, indicating a balanced embedding space.
Paratope-Epitope Masking:
- Using masked inputs (interface residues only) yielded systematically higher R@k scores than full-sequence inputs. For example, at 80% clustering, masked R@1 reached ~7% vs. ~6% for full sequences.
In-Distribution (Antibody Clustering):
- When antigens were familiar (no antigen clustering) but antibodies were clustered at 90% identity, CALM achieved R@1 $\approx$ 18-19% and R@10 $\approx$ 33-35%.
- This represents a ~46x improvement over random at R@1, demonstrating strong generalization to unseen antibody sequences within known antigen contexts.

5. Significance and Future Directions

Paradigm Shift: CALM moves beyond structure-conditioned design (which requires known epitopes) to a sequence-based retrieval system. This enables "reading" immune repertoires for diagnostics and "writing" novel therapeutics on demand.
Foundation for ISFM: While currently a retrieval model, the architecture is designed to evolve into a full Foundation Model that unifies discriminative alignment and generative design.
Data Efficiency: The results suggest that immune recognition follows specific scaling laws that may allow for high performance with relatively small datasets (~3,000 pairs) compared to the massive datasets required for general vision-language models (e.g., CLIP).
Limitations: The study is purely computational; no wet-lab validation was performed. The generative decoder is proposed but not yet trained. Future work will focus on scaling the dataset, training the decoder, and experimental validation.

Conclusion: CALM establishes a robust, sequence-only foundation for bidirectional antibody-antigen specificity prediction, proving that contrastive learning can effectively map the complex "grammar" of immune recognition without requiring explicit structural modeling at inference time.