Original authors: Ada Fang, Robert G. Alberstein, Simon Kelow, Frédéric A. Dreyer

Published 2026-06-03

📖 5 min read🧠 Deep dive

Original authors: Ada Fang, Robert G. Alberstein, Simon Kelow, Frédéric A. Dreyer

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine the human immune system as a massive library of keys (antibodies) designed to unlock specific locks (viruses and bacteria). The most important part of these keys are the "teeth" at the very tip, which wiggle around to grab onto the lock. In the scientific world, these wiggly teeth are called loops (specifically, Complementarity-Determining Regions or CDRs).

For decades, scientists have tried to organize these loops into neat categories, like sorting books by genre. However, this old system had two big problems:

It was incomplete: About 20% of the loops didn't fit into any category, leaving them as "unclassifiable noise."
It was too simple: The old system only looked at the shape of the loop, ignoring the specific letters (amino acids) that made it up.

Enter IGLOO (ImmunoGlobulin LOOp Tokenizer). Think of IGLOO as a new, super-smart librarian who doesn't just sort books by genre, but understands the story (sequence) and the binding (shape) simultaneously.

Here is how the paper explains IGLOO and its achievements, broken down into simple concepts:

1. The "Token" Analogy: Turning Shapes into Words

In computer science, "tokenization" is like turning a sentence into a list of words that a computer can understand.

The Old Way: Previous methods tried to describe a loop by looking at every single atom, like trying to describe a painting by listing the color of every single pixel. It was slow and missed the big picture.
The IGLOO Way: IGLOO looks at a whole loop and says, "This specific shape and sequence is like the word 'Apple'." It turns a complex 3D structure into a single, compact digital "token."
The Magic: It learns this by looking at the "backbone" of the loop (the angles where the chain bends). If two loops bend the same way, IGLOO gives them similar tokens, even if their amino acid letters are different.

2. The Training: Learning by Comparison

IGLOO was trained using a game of "Find the Twin."

The computer was shown pairs of loops.
If two loops had very similar bending angles, they were marked as "twins" (positive pairs).
If they were very different, they were marked as "strangers" (negative pairs).
IGLOO learned to push the "twins" close together in its digital brain and push the "strangers" far apart. This allowed it to create a map where similar loops live in the same neighborhood.

3. What IGLOO Actually Achieved

The paper tests this new librarian in three specific ways:

A. The "Find My Twin" Test (Retrieval)

The Task: Give IGLOO a loop and ask it to find the most similar loops from a database of millions.
The Result: IGLOO was the best at this. It found matching loops better than any previous method.
The Highlight: It was especially good at finding matches for the H3 loop, which is the most chaotic and diverse loop in the antibody family. It beat the previous best method by nearly 6%.

B. The "Sorting Hat" Test (Clustering)

The Task: Can IGLOO sort loops into the old, established categories (canonical clusters) that scientists have used for years?
The Result: Yes. It successfully sorted 90% of the loops into the correct existing categories.
The Bonus: Unlike the old system, IGLOO can also sort the 20% of loops that didn't have a category before, giving them a place to live without forcing them into a box they don't fit.

C. The "Predictor" and "Creator" Tests
The authors plugged IGLOO's new "tokens" into two different AI models to see if they made them smarter:

IGLOOLM (The Predictor): This model predicts how well an antibody will stick to a virus. When given IGLOO's tokens, it became better at predicting this "stickiness" (binding affinity) than the base model, often outperforming much larger models.
IGLOOALM (The Creator): This model tries to design new loops. When asked to invent a loop that looks like a specific shape but has a different sequence of letters, IGLOOALM did a better job than current state-of-the-art tools. It created loops that were diverse in their letters but kept the correct 3D shape.

4. Why This Matters (According to the Paper)

The paper concludes that by treating antibody loops as "multimodal tokens" (combining shape and sequence), IGLOO captures the true diversity of how these loops work.

It fixes the "missing data" problem of old classification systems.
It makes protein language models (the AI brains) smarter and more efficient.
It helps in the rational design of new antibodies by allowing scientists to search for shapes and generate new ones more effectively.

In short: IGLOO is a new tool that translates the complex, wiggly shapes of antibody tips into a language computers understand better, allowing us to find, sort, and design them with much higher precision than before.

Technical Summary: IGLOO – ImmunoGlobulin LOOp Tokenizer

Problem Statement

The Complementarity-Determining Regions (CDRs) of antibodies are loop structures critical for antigen recognition and binding. Historically, the diversity of these loops has been categorized into "canonical clusters" based on backbone dihedral angles (Chothia & Lesk, 1987; North et al., 2011). However, existing canonical clustering approaches suffer from three primary limitations:

Limited Coverage: A significant portion of antibody loops, particularly the highly diverse H3 loops (76.3%), lack a known matching canonical cluster.
Lack of Sequence Integration: Existing clusters rely solely on backbone coordinates or dihedral angles, ignoring sequence information.
Incompatibility with Foundation Models: Current clustering methods cannot be readily integrated into protein language models (PLMs) or multimodal foundation models, which typically tokenize at the amino acid level rather than the substructure loop level.

While recent multimodal PLMs have incorporated structure tokens, they generally focus on amino acid-level reconstruction and do not account for the higher-level modularity of protein domains. There remains an open challenge in tokenizing immunoglobulin loops to capture both structural and sequence diversity for effective representation learning.

Methodology: IGLOO

The authors introduce IGLOO (ImmunoGlobulin LOOp Tokenizer), a multimodal tokenizer designed to encode antibody loops by combining sequence and backbone dihedral angles.

Architecture and Input

IGLOO operates at the substructure loop level. For a loop of length $n$ , the input consists of:

Sequence: A sequence of amino acids $a = (a_1, \dots, a_n)$ .
Structure: Backbone dihedral angles $\phi, \psi, \omega \in (-\pi, \pi]^n$ .

The dihedral angles are projected onto the unit circle $(\cos \theta, \sin \theta)$ and passed through a linear layer. The sequence is encoded using learnable embeddings for the 20 canonical amino acids. These two modalities are summed to create a multimodal embedding $X$ . A Transformer architecture (based on ESM-2) processes these residues, outputting a continuous classification token $t$ representing the entire loop and residue-level multimodal tokens $x_i$ .

Training Objectives

IGLOO is trained using a self-supervised contrastive learning framework with four objectives:

Masked Reconstruction: Predicting masked dihedral angles and masked amino acid identities.
Contrastive Learning of Backbones: The core innovation. IGLOO learns to map loops with similar backbone dihedral angles closer together in the latent space. Similarity is defined by the dihedral angle distance $D$ $D$ (North et al., 2011), which accounts for the chiral nature of proteins and side-chain positioning better than RMSD.
- Positive Pairs: Loops of the same length with $D < 0.1$ .
- Negative Pairs: Loops of different lengths or same length with $D > 0.47$ .
- A margin is applied to prevent overfitting to the specific threshold used for canonical cluster definitions.
Codebook Learning: A vector quantization loss assigns loops to discrete tokens ( $\hat{t}$ ) from a codebook, facilitating fast retrieval and comparison.

Integration into Language Models

The authors demonstrate two methods for incorporating IGLOO tokens into protein language models, fine-tuned from the base antibody model IgBert:

IGLOOLM: Inserts the continuous classification token $t$ at the start of each CDR loop. This captures the overall loop conformation context.
IGLOOALM: Inserts both the classification token $t$ and the multimodal residue tokens $x_i$ for each amino acid in the loop. This provides residue-level structural context.

Key Results

1. Paratope Retrieval

IGLOO was evaluated on retrieving similar loop structures from the SAbDab database.

Performance: IGLOO achieved state-of-the-art performance in retrieving loops with similar backbone structures (defined by $D < 0.47$ and RMSD $< 1$ Å).
H3 Loop Specifics: For the highly diverse H3 loop, IGLOO outperformed the best existing structure tokenizer (Amino Aseed) by 5.9% and the best protein language model (ESM-2) by 69.8% in terms of precision at rank 20 for dihedral distance retrieval.
Efficiency: IGLOO outperformed larger models (e.g., ESM-2 3B) despite being significantly smaller, demonstrating the efficiency of loop-level tokenization.

2. Recovery of Canonical Clusters

IGLOO was tested on its ability to recover known canonical clusters defined by Kelow et al. (2022).

Purity: The quantized tokens achieved high cluster purity (e.g., 0.983 for Heavy CDR4, 0.900 for Heavy CDR2).
Coverage: Unlike traditional methods, IGLOO assigns tokens to all loops, addressing the coverage gap where 20.3% of loops previously had no canonical match. It successfully reproduces known canonical conformations for 90.6% of loops in SAbDab.

3. Binding Affinity Prediction

IGLOOLM was evaluated on predicting the binding affinity of heavy chain variants across 10 antibody-antigen targets (AbBiBench).

Performance: IGLOOLM outperformed the base IgBert model on 8 out of 10 targets.
Comparison: It performed on par with or better than models with 7× more parameters (e.g., ESM-2 3B).
Observation: Models utilizing only the classification token (IGLOOLM) outperformed those using residue-level structural tokens (IGLOOALM) for this specific task, suggesting that for deep mutational scanning where sequences differ by few mutations, the loop conformation summary is more robust than potentially noisy residue-level structural predictions.

4. Controllable Loop Sampling

IGLOOALM was used to sample antibody loops with masked sequences.

Diversity vs. Consistency: The sampled loops exhibited high sequence diversity while maintaining structural consistency with the original loop.
Case Study: In redesigning the H3 loop of a SARS-CoV-2 antibody (PDB 7TCQ), IGLOOALM generated sequences with an average edit distance of 6.6 from the original, yet the predicted structures maintained the original beta-hairpin fold with an average RMSD of 0.79 Å. This outperformed state-of-the-art antibody inverse folding models (AbMPNN, AntiFold).

Significance and Claims

The paper claims that IGLOO demonstrates the benefit of introducing multimodal tokens specifically for antibody loops. By encoding the diverse landscape of loop conformations through a combination of sequence and dihedral angles, IGLOO:

Improves Foundation Models: Enhances the expressiveness of protein language models, allowing them to capture structural motifs that sequence-only models miss.
Solves Coverage Issues: Provides a tokenization scheme that covers the entire landscape of antibody loops, including those previously unclassifiable by canonical methods.
Advances Rational Design: Facilitates applications in binding affinity prediction and the controllable generation of diverse, structurally consistent antibody loops, which is critical for antibody engineering and lead optimization.

The authors note that while IGLOOALM shows strong in silico results, comprehensive wet-lab validation is required to confirm that redesigned antibodies maintain actual antigen binding. They envision future extensions to include all-atom structures, epitope information, and binding affinity directly as modalities.

Tokenizing Loops of Antibodies