Multi-Modal Protein Representation Learning with CLASP

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to understand a complex machine, like a Swiss Army knife. To truly know what it does, you need three different kinds of information:

The Blueprint (Structure): How the metal is folded and where the blades are located in 3D space.
The Parts List (Sequence): The specific order of the screws, springs, and steel strips that make it up.
The User Manual (Text): A written description saying, "This tool cuts paper, opens bottles, and is great for camping."

For a long time, scientists studying proteins (the tiny machines that run our bodies) have been looking at these three things separately. Some looked only at the parts list (the DNA code), others only at the blueprint (the 3D shape), and others only read the user manuals (scientific articles).

The Problem:
Looking at just one view is like trying to guess what a car is by only looking at its engine, or only reading the owner's manual without seeing the car. You miss the big picture. A protein's shape often tells you more about what it does than its parts list alone, but the "user manual" often explains why it does it in a way the blueprint can't.

The Solution: CLASP
The authors of this paper created a new AI tool called CLASP (Contrastive Language–Amino acid Sequence–Structure Pretraining). Think of CLASP as a super-intelligent translator that learns to speak three languages at once: "Shape," "Sequence," and "Text."

Here is how it works, using a simple analogy:

The "Three-Legged Stool" Analogy

Imagine you are trying to identify a specific person in a crowd.

Leg 1 (Structure): You see their 3D face and body shape.
Leg 2 (Sequence): You see their fingerprint or DNA.
Leg 3 (Text): You read a biography about them.

Old AI models were like people who could only look at one leg. If you showed them a fingerprint, they might guess the name, but they'd be shaky. If you showed them a biography, they might guess the face, but they'd be unsure.

CLASP is like a person who can look at all three legs simultaneously. It learns that "This specific fingerprint + This specific face shape + This specific biography" all belong to the same person. It builds a single, unified mental map where these three different views of the same protein are glued together.

How CLASP Learned (The Training Camp)

To teach CLASP, the researchers didn't just show it pictures. They used a game called "The Matching Game":

They took a protein and showed the AI its 3D shape (from a database called PDB), its amino acid sequence (the code), and its written description (from scientific papers).
They mixed them up. They showed the AI: "Here is the shape of Protein A. Here is the description of Protein B. Are they a match?"
The AI had to learn to say "No!" because the shape and description didn't belong together.
But when they showed the shape of Protein A and the description of Protein A, the AI learned to say "Yes!" and pull those two ideas closer together in its brain.

By playing this game millions of times, CLASP learned that if you know the shape, you can guess the text, and if you know the text, you can guess the sequence. It learned the deep, hidden connections between them.

Why This is a Big Deal (The Magic Tricks)

Once CLASP was trained, the researchers tested it on some "magic tricks" that other models couldn't do well:

The "Zero-Shot" Guess: They showed CLASP a protein structure it had never seen before and asked, "What does this look like in text?" or "What is its sequence?" CLASP guessed correctly almost every time. It was like showing a detective a photo of a suspect's shoe and having them instantly write a full description of the suspect's face and name, even if they'd never met them.
The "Library Search": Imagine you have a library of 36,000 proteins. You give CLASP a messy, handwritten note describing a protein (e.g., "The thing that eats bacteria in our blood"). CLASP didn't just find the exact match; it found the right protein even if the note was written in a totally different style than the library's official catalog. It understood the meaning, not just the keywords.
The "Family Reunion": When the researchers looked at the data CLASP created, they saw that proteins from the same "family" (like cousins) naturally grouped together, even if they looked slightly different. This means CLASP understands the biological "family tree" better than previous models.

The Secret Sauce

Why did CLASP work so well?

Geometry Matters: It used a special type of math (called E(3)-invariant GNN) that understands 3D space. It knows that if you rotate a protein, it's still the same protein. Old models often got confused by rotation; CLASP never does.
The Trio Effect: The researchers proved that if you remove any one of the three inputs (Shape, Sequence, or Text), the model gets dumber. They need all three to work together, like a three-legged stool. If you remove one leg, the whole thing falls over.

The Bottom Line

CLASP is a universal translator for biology. It bridges the gap between the hard, physical 3D world of proteins, the code that builds them, and the human language we use to describe them.

This means scientists can now ask questions like: "Show me all proteins that look like this 3D shape but are described as 'cancer-fighting' in the literature," and CLASP can find them instantly. It's a powerful new tool that could speed up drug discovery, help us understand diseases, and make sense of the massive amount of biological data we have today.

1. Problem Statement

Proteins are complex biological entities defined by three distinct but interconnected modalities:

Amino Acid Sequence: Linear strings of residues.
3D Structure: Spatial atomic coordinates (PDB files) which are often more conserved than sequences and crucial for function.
Textual Descriptions: Curated natural language annotations (e.g., UniProt, literature) describing biochemical roles, disease associations, and functions.

Existing deep learning approaches typically focus on single modalities (e.g., Protein Language Models like ESM or ProtT5 for sequences) or bimodal alignments (e.g., Sequence-Text or Sequence-Structure). However, these methods often fail to capture the synergistic relationships between all three modalities simultaneously. Specifically, purely geometric models ignore rich functional text, while text-sequence models lack the structural constraints necessary for accurate functional inference. There is a critical need for a unified tri-modal framework that integrates structural, sequential, and textual signals to create biologically grounded, general-purpose protein embeddings.

2. Methodology: CLASP Framework

The authors introduce CLASP (Contrastive Language–Amino acid Sequence–Structure Pretraining), a unified tri-modal framework designed to learn a shared embedding space for proteins.

Architecture Components

CLASP consists of two primary modules:

Structure Encoder (Geometric Deep Learning):
- Input: Protein Data Bank (PDB) files.
- Graph Construction: PDB files are converted into atom-level graphs using the Graphein library. Nodes represent atoms (annotated with 7 Meiler biochemical descriptors and 3D coordinates), and edges represent Euclidean distances.
- Encoder: An E(3)-invariant Graph Neural Network (EGNN) processes these graphs. Unlike standard GNNs, EGNNs enforce rotational and translational invariance, ensuring embeddings depend only on intrinsic geometry, not arbitrary orientation.
- Output: A 512-dimensional structure embedding.
Alignment Module (Tri-Modal Contrastive Learning):
- Inputs:
  - Sequence: Precomputed 1024-dimensional embeddings from ProtT5 (a protein language model).
  - Text: Precomputed 1024-dimensional embeddings from BioGPT (a biology-tuned LLM) based on UniProt functional descriptions.
  - Structure: The output from the EGNN.
- Projection: Each modality passes through a trainable linear projection layer to map to a shared 512-dimensional space.
- Training Objective: A Tri-Modal Contrastive Loss inspired by CLIP and CG3D (CLIP Goes 3D). The model optimizes a symmetric cross-entropy loss across three pairs of modalities:
  1. Structure $\leftrightarrow$ Sequence
  2. Structure $\leftrightarrow$ Text
  3. Sequence $\leftrightarrow$ Text
- Mechanism: The loss function pulls embeddings of matching triplets (same protein) closer together while pushing non-matching triplets apart in the shared latent space.

3. Key Contributions

First Tri-Modal Protein Framework: CLASP is the first model to jointly align protein structure, sequence, and natural language descriptions in a single contrastive learning framework.
Geometry-Aware Encoding: By utilizing E(3)-invariant GNNs, the model captures structural properties that are robust to rotation and translation, addressing a key limitation of standard GNNs in molecular modeling.
Zero-Shot Capabilities: The framework enables zero-shot classification and retrieval tasks across modalities without task-specific fine-tuning.
Synergistic Integration: The study demonstrates that integrating all three modalities yields superior performance compared to bimodal or unimodal baselines, confirming that structural, sequential, and textual signals are complementary.

4. Results

The authors evaluated CLASP on a dataset of 35,911 unique UniProt proteins linked to PDB structures.

A. Zero-Shot Cross-Modal Alignment

Sequence-Structure Alignment: CLASP achieved an AUROC of 0.976 and AUPRC of 0.977, significantly outperforming baselines like Progres-CLIP (AUROC 0.919) and ProstT5.
Description-Structure Alignment: CLASP achieved an AUROC of 0.858, outperforming models using only structure encoders (Progres, COLLAPSE) paired with text.
Sequence-Description Alignment: CLASP achieved a mean F1 score of 0.932, slightly outperforming ProteinCLIP and ProtST, demonstrating that structural constraints help regularize the sequence-text relationship.

B. Sequence Retrieval

Given a textual description, CLASP successfully retrieved the correct amino acid sequence from a pool of 35,911 candidates.
Robustness: The model maintained high performance across different description styles:
- UniProt entries: ~99.85th percentile rank.
- Literature-style: ~99.58th percentile rank.
- Freehand (expert-written): ~98.90th percentile rank.
This indicates the model generalizes well to linguistic variations not seen during training.

C. Clustering and Biological Meaning

Family Clustering: CLASP's embeddings clustered proteins by family (Kinases, GPCRs, Ion Channels, etc.) significantly better than baselines (Progres, ProstT5, COLLAPSE) across multiple metrics (Silhouette score, Calinski-Harabasz index, KL divergence).
Ablation Studies:
- Replacing EGNN with a standard GNN caused a >15 point drop in MCC, highlighting the necessity of E(3) invariance.
- Removing any single modality (training as bimodal) led to marked performance drops, confirming the synergistic value of the tri-modal approach.
- Removing explicit protein names from text descriptions resulted in only modest performance decreases, proving the model learns semantic relationships rather than relying on lexical matching.

5. Significance and Impact

Unified Representation: CLASP establishes a biologically grounded embedding space that unifies molecular geometry, evolutionary sequence data, and semantic functional knowledge.
Interpretability: The model's ability to cluster by protein family and retrieve sequences from diverse text descriptions suggests it captures high-level biological concepts (e.g., function, mechanism) rather than just surface-level patterns.
Applications: The framework opens new avenues for:
- Protein Annotation: Automatically inferring function from structure or sequence.
- Drug Discovery: Retrieving proteins with specific structural or functional properties from text queries.
- Literature Synthesis: Bridging the gap between unstructured scientific literature and structured molecular data.
Future Directions: The authors suggest extending the framework to include evolutionary context, tissue-specific expression, and conditional generation (synthesizing structures from text prompts).

In conclusion, CLASP demonstrates that integrating structural, sequential, and textual modalities via contrastive learning creates a powerful, general-purpose tool for protein understanding, outperforming state-of-the-art single and bimodal models in zero-shot tasks.