How Not to be Seen: Predicting Unseen Enzyme Functions… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Problem: The "Unseen" Enzyme

Imagine you are a librarian trying to organize a massive, chaotic library of books (proteins). You have a perfect catalog system called the EC System, which sorts books into four levels of detail:

Genre (e.g., Mystery)
Sub-genre (e.g., Detective)
Plot Type (e.g., Whodunit)
Specific Title (e.g., The Case of the Missing Cat)

For most books, you know the exact title (Level 4). But in the world of biology, we are discovering millions of new "books" (enzymes) every day. The problem? We often don't know the specific title. We've never seen this exact book before.

If you try to force a new book into the "Specific Title" slot, you might guess wrong. But, you can usually figure out the Genre, Sub-genre, and Plot Type (Levels 1–3). Knowing it's a "Detective Mystery" is still incredibly helpful, even if you don't know the exact title yet. It tells you what kind of story to expect.

The Goal: The scientists wanted to build a computer program that could take a brand-new, unknown enzyme and say, "I don't know the exact title, but I'm 90% sure it's a 'Phosphodiesterase' (a specific type of chemical cutter)."

The Old Way vs. The New Way

The Old Way: "The Look-Alike" (BLAST)

Traditionally, scientists used a method like BLAST. Imagine you have a new book, and you look at the cover. If the cover looks 90% like a book you already have, you assume they are the same story.

The Flaw: This works great if the books look very similar. But if the new book has a slightly different cover design (low sequence similarity), the old method gets confused and might put a "Mystery" book into the "Cookbook" section just because the font looked similar.

The New Way: "EnzPlacer" (The Smart Organizer)

The authors created a new tool called EnzPlacer. Instead of just looking at the cover, it learns the vibe and structure of the stories.

They used a technique called Contrastive Learning. Think of this as a game of "Hot and Cold" in a giant room:

The Training: The computer is shown thousands of books. It learns that all "Detective Mysteries" should stand close together in the room, and all "Cookbooks" should stand far away.
The Hierarchy Trick (HiNCE): This is the secret sauce. Standard methods just say, "These two books are the same." But EnzPlacer is smarter. It learns the family tree.
- It knows that even if two books have different titles, if they are both "Detective Mysteries," they should still stand near each other.
- It learns that "Detective Mysteries" and "Spy Thrillers" are siblings, so they should be in the same aisle, even if they aren't the exact same book.

How It Works (The "Magic" Step)

The paper introduces a method called HiNCE (Hierarchical Exemplar Contrastive Objective).

Imagine a dance floor: The computer puts all the enzymes on a dance floor.
The Goal: It wants to group them by family.
The Twist: It doesn't just group them by exact match. It creates "Centroids" (imaginary dance captains) for every level of the family tree.
- There is a captain for "Enzymes" (Level 1).
- A captain for "Hydrolases" (Level 2).
- A captain for "Phosphodiesterases" (Level 3).
When a new, unknown enzyme walks in, the computer asks: "Which captains does this dancer vibe with?" Even if the dancer doesn't know the specific title (Level 4), they might naturally gravitate toward the "Phosphodiesterase" captain.

The Results: Why It Matters

The scientists tested this on a "hard mode" dataset:

The "Unseen" Test: They gave the computer enzymes it had never seen before, with titles it had never learned.
The Result:
- The old "Look-Alike" method (BLAST) fell apart when the enzymes didn't look very similar. It got lost.
- EnzPlacer kept its cool. Even when the enzymes were strangers, EnzPlacer could still say, "Hey, this one belongs in the 'Phosphodiesterase' family!"
- It was especially good at predicting the Level 3 category (the "Plot Type"), which is the sweet spot for helping scientists design experiments.

A Real-Life Example

The paper mentions a specific enzyme (Protein A0A1D8PNZ7).

The Reality: It's a "Phosphodiesterase" (it cuts specific chemical bonds).
The Old Method: Looked at the sequence, got confused, and said, "This looks like a Kinase" (a totally different type of enzyme that adds energy). This is a huge mistake!
EnzPlacer: Looked at the "vibe" and the family structure, and correctly said, "This is a Phosphodiesterase."

The Takeaway

EnzPlacer is like a super-smart librarian who doesn't need to know the exact title of a book to know where it belongs on the shelf.

In a world where we are discovering new biological "books" faster than we can read them, this tool helps scientists narrow down the search. Instead of guessing blindly, they can say, "We don't know the exact function, but we know it's a 'Chemical Cutter,' so let's test it with that specific chemical."

It turns a wild guess into a smart, educated hypothesis, saving time and money in the lab.

1. Problem Statement

The prediction of enzyme function from amino acid sequences remains a significant challenge in computational biology. While high-throughput sequencing generates vast amounts of genomic data, fewer than 0.1% of protein sequences are experimentally annotated.

The Core Challenge: Most existing models operate under an "in-distribution" assumption, where test labels exist in the training set. However, real-world scenarios often involve unseen enzyme functions (novel EC4 serial numbers) that have no exact match in training data.
The Goal: Instead of failing to assign a non-existent label, the goal is to accurately place an unseen enzyme into the correct functional neighborhood (i.e., predicting the correct EC1, EC2, and EC3 levels) even when the specific EC4 label is unknown. This narrows the search space for experimentalists.

2. Methodology: EnzPlacer

The authors propose EnzPlacer, a contrastive learning framework designed to learn a representation space that respects the hierarchical structure of the Enzyme Commission (EC) classification system.

A. Data Curation and Splitting Strategy

Dataset: 183,613 unique protein sequences derived from the ExPASy ENZYME database, mapped to UniProtKB.
Unseen Split (Out-of-Distribution):
- Strategy: The dataset is split at the EC4 level (specific enzyme activity).
- Training: Contains the most abundant EC4 groups within each EC3 family (established classes).
- Testing: Contains rare EC4 groups within the same EC3 families.
- Constraint: Test EC4 labels are never seen during training, but the parent EC1–EC3 families are present.
- Validation: A subset of 9,901 proteins with experimentally verified annotations (excluding homology-inferred labels) was used to ensure ground-truth reliability.
- Hard Negatives: Additional subsets were created by filtering out test proteins with >50%, >30%, and >10% sequence identity to training proteins to test generalization under low homology.

B. Model Architecture

Input: Fixed-length protein embeddings generated by the pre-trained ESM-1b (Evolutionary Scale Modeling) encoder. The encoder weights are frozen.
Projection Head: A lightweight Multi-Layer Perceptron (MLP) maps the ESM embeddings ( $h(x)$ ) to a task-specific representation space ( $z(x)$ ).
Loss Function: Hierarchical Exemplar Contrastive Objective (HiNCE)
The core innovation is a composite loss function that combines:
1. Instance-Level Supervised Contrastive Loss ( $L_{inst}$ ): Pulls embeddings of proteins with the same EC4 label together and pushes different labels apart.
2. Hierarchical Exemplar Loss ( $L_{exem}$ ): Explicitly enforces the EC hierarchy.
  - For each protein, the model calculates centroids for its EC1, EC2, EC3, and EC4 prefixes.
  - The loss encourages the protein embedding to align with the centroids of its own hierarchy levels (e.g., an enzyme labeled 1.2.3.4 should be close to the centroids of 1, 1.2, 1.2.3, and 1.2.3.4).
  - Leave-One-Out (LOO): To prevent trivial self-influence in small classes, the centroid calculation excludes the anchor protein itself.
- Hard Negative Mining: The model periodically updates a distance map to select "hard negatives" (proteins that are currently similar in the embedding space but have different labels) to improve discrimination.

C. Inference

At test time, a query protein is embedded, and its EC label is assigned based on the nearest neighbor in the learned embedding space (1-NN or k-NN voting). Performance is evaluated by comparing the predicted EC prefixes against the ground truth at levels 1, 2, and 3.

3. Key Contributions

Novel Evaluation Protocol: The paper introduces a rigorous "Unseen-EC4" benchmark where test labels are completely absent from training, simulating the discovery of novel enzyme functions.
Hierarchical Contrastive Learning (HiNCE): A new loss function that explicitly models the directed acyclic graph (DAG) structure of the EC ontology, ensuring that proteins sharing higher-level prefixes (EC1–3) remain clustered even if their specific EC4 differs.
Robustness to Low Homology: Demonstrates that learned representations can outperform traditional homology-based methods (BLAST) even when sequence identity is extremely low (<10%).

4. Results

The model was evaluated against baselines including CLEAN (contrastive learning), GloEC (hierarchy-GCN), ProteInfer (deep CNN), and BLASTp (homology transfer).

Unseen-EC4 Performance (Experimentally Validated Set):
- EnzPlacer achieved the highest accuracy and macro-F1 scores at both EC2 and EC3 levels.
- EC2 Accuracy: EnzPlacer (0.435) vs. CLEAN (0.385) vs. BLASTp (lower).
- EC3 Accuracy: EnzPlacer (0.356) vs. CLEAN (0.310) vs. BLASTp (lower).
- Low-Similarity Regime (<10% identity): EnzPlacer maintained a significant performance margin over baselines. Notably, BLASTp performance collapsed in this regime, dropping sharply, whereas EnzPlacer's decline was more gradual, proving its ability to capture functional signals beyond simple sequence identity.
Case Studies:
- EnzPlacer correctly identified the EC3 family (Phosphodiesterases, 3.1.4.*) for proteins where BLAST and other baselines incorrectly assigned them to unrelated families (e.g., Kinases 2.7.11.*).
Seen-EC4 Performance (In-Distribution):
- In traditional settings where test labels exist in training, all methods performed well (EnzPlacer accuracy ~0.91). However, the gap between methods narrowed, highlighting that the "Unseen" task is the true differentiator.
Visualization (t-SNE):
- Embeddings from EnzPlacer showed significantly better clustering of EC3 families (e.g., 3.1.4) compared to raw ESM embeddings, which were scattered. This confirms the model successfully learned hierarchy-consistent geometry.

5. Significance and Implications

Scientific Impact: The study shifts the paradigm from "predicting the exact label" (which is impossible for novel enzymes) to "predicting the functional context." Accurate EC3 prediction narrows the hypothesis space for experimentalists, guiding them toward specific reaction mechanisms (e.g., distinguishing a phosphodiesterase from a kinase).
Methodological Advance: The work demonstrates that hierarchy-aware contrastive learning is superior to flat classification or standard homology transfer for generalizing to unseen functional classes.
Limitations & Future Work:
- The current model handles single-label enzymes; future work must address promiscuous/multi-functional enzymes.
- It relies solely on sequence data; integrating structural and kinetic data could further improve robustness.
- The model does not yet output calibrated uncertainty, which is crucial for flagging unreliable predictions in open-set scenarios.

In conclusion, EnzPlacer provides a robust framework for placing novel enzymes into known functional spaces, offering a critical tool for the annotation of the "dark matter" of the proteome.

How Not to be Seen: Predicting Unseen Enzyme Functions using Contrastive Learning