Original authors: Mingqing Wang, Zhiwei Nie, Athanasios V. Vasilakos, Yonghong He, Zhixiang Ren

Published 2026-05-26

📖 5 min read🧠 Deep dive

Original authors: Mingqing Wang, Zhiwei Nie, Athanasios V. Vasilakos, Yonghong He, Zhixiang Ren

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Problem: The "Smoothie" of Protein Data

Imagine you have a protein. In the world of computer science, we often try to turn a protein's 3D shape into a list of numbers (a "vector" or "embedding") so a computer can understand it.

Currently, most advanced AI models do this by blending everything about the protein into one giant, messy smoothie.

Is it flexible? Yes.
Is it hydrophobic (water-repelling)? Yes.
Is it curved? Yes.
Is it stable? Yes.

The AI puts all these facts into a single cup. While the AI can use this smoothie to guess what the protein does, it's hard to know why it made that guess. It's like tasting a smoothie and knowing it has fruit in it, but not being able to tell which specific fruit is which. This makes it hard for scientists to trust the AI or understand the specific rules of biology.

The Solution: ProtDiS (The "Ingredient Separator")

The authors created a new tool called ProtDiS. Think of ProtDiS not as a blender, but as a high-tech ingredient separator.

Instead of keeping the protein data as one big smoothie, ProtDiS takes that messy data and sorts it into eight distinct, labeled jars plus one "leftover" jar. Each jar is designed to hold only one specific type of information:

Shape Jar: Holds only information about the protein's shape (like if it's a helix or a sheet).
Exposure Jar: Holds only information about how much of the protein is touching water.
Flexibility Jar: Holds only information about how much the protein wiggles.
Packing Jar: Holds only information about how tightly the atoms are packed together.
Hydrophobicity Jar: Holds only water-repelling data.
Stability Jar: Holds data on how strong the protein's bonds are.
Complexity Jar: Holds data on how tangled the local area is.
Curvature Jar: Holds data on how bent the structure is.
The "Leftover" Jar: This is a special catch-all for any weird structural information that doesn't fit neatly into the other eight jars.

How It Works: The "Strict Librarian"

The paper uses a concept called the Information Bottleneck. Imagine a strict librarian (the AI) who is trying to organize a chaotic library.

The Goal: The librarian wants to make sure that if you ask for the "Flexibility" book, you get only flexibility facts, and no "Shape" or "Stability" facts mixed in.
The Method: The AI is trained with a set of rules (knowledge-guided). It is told: "You must predict the 'Flexibility' of the protein using only the Flexibility Jar. If you accidentally sneak in 'Shape' data, you get a penalty."
The Result: The AI learns to force the data into these separate jars. It learns to compress the information so that each jar is efficient and independent.

Why This Matters: Finding the Needle in the Haystack

The paper claims that this separation makes the AI much smarter at specific tasks, especially when proteins look very similar but do different jobs.

The Analogy: The Twin Brothers
Imagine two identical twins (proteins with the same shape/fold).

Old AI: Sees they look identical and assumes they do the exact same job. It gets confused when one is a doctor and the other is a chef.
ProtDiS: Looks into the specific jars. It sees that while their "Shape" jar is identical, the "Flexibility" jar and the "Packing" jar are slightly different. These tiny differences are the secret keys that tell the AI, "Ah, this one is a doctor, and that one is a chef."

The Results: What the Paper Found

Better at "Hard" Tests: When the researchers tested the AI on proteins that looked very similar to each other (a "structure-based split"), ProtDiS performed significantly better than the old models. It could tell the difference between proteins that look alike but function differently.
Clearer Explanations: Because the data is in separate jars, scientists can now look at the "Flexibility Jar" and say, "The AI made this decision because the protein is very flexible," rather than guessing.
No Information Lost: The "Leftover" jar ensures that even though they separated the data, they didn't throw anything away. If you mix all the jars back together, you get the original protein data back perfectly.

Summary

ProtDiS is a new way of teaching computers to understand proteins. Instead of giving the computer a blurry, mixed-up photo of a protein, it gives the computer a set of clear, labeled X-rays, each showing a different specific feature (like shape, flexibility, or stability). This allows the computer to make better predictions and helps scientists understand why a protein works the way it does, especially when proteins look very similar on the surface but act very differently underneath.

Technical Summary: Learning Protein Structure-Function Relationships through Knowledge-guided Representation Decomposition

1. Problem Statement

Proteins encode diverse biological functions within complex three-dimensional structures. While recent deep learning models (e.g., ESM-3, AlphaFold) have achieved breakthroughs in structure prediction and generative design, their internal representations remain highly entangled. These latent spaces intermix geometric, physicochemical, and topological signals, obscuring the specific biophysical principles that underlie function. Consequently, it is difficult to interpret these models mechanistically or to generalize them across proteins with similar folds but divergent functions. Existing efforts to impose structure via external constraints often fail to fully disentangle these signals or lack explicit alignment with biologically grounded properties.

The core challenge addressed is how to transform entangled structural embeddings into disentangled, knowledge-aligned factors that selectively preserve functionally relevant information while remaining interpretable and robust to structural perturbations.

2. Methodology: ProtDiS

The authors propose ProtDiS, a knowledge-guided representation factorization framework that decomposes pretrained protein micro-environment embeddings into multiple interpretable knowledge channels.

2.1. Core Framework

ProtDiS takes a high-dimensional structural embedding $s$ (e.g., from ESM-3) and decomposes it into:

Knowledge Channels ( $Z_1, \dots, Z_K$ ): A set of latent representations, each explicitly aligned with a predefined biophysical or geometric attribute (e.g., secondary structure, packing density, flexibility).
Residual Channel ( $Z_c$ ): An additional channel capturing structural information not explained by the predefined knowledge axes, ensuring information completeness.

2.2. Theoretical Foundation

The framework is grounded in the Information Bottleneck (IB) principle. The goal is to learn representations that maximize mutual information with a specific knowledge signal while minimizing redundant information shared with other channels.

Knowledge Alignment: Each channel $Z_k$ is trained to maximize $I(Z_k; Y_k)$ , where $Y_k$ is the target biophysical property.
Redundancy Reduction: The framework encourages functional disentanglement by minimizing the mutual information between the residual channel and all knowledge variables, and by penalizing correlations between different knowledge channels.
Completeness: The joint representation $(Z_1, \dots, Z_K, Z_c)$ must be sufficient to reconstruct the original embedding $s$ .

2.3. Optimization Objectives

The training objective combines several loss components:

Knowledge Supervision Loss: Minimizes prediction error for each knowledge variable $Y_k$ using a supervised head.
Bottleneck Regularization: Uses KL-divergence to constrain the information capacity of each channel, encouraging compact representations.
Reconstruction Loss: Ensures the residual channel retains complementary information necessary to reconstruct the original structural embedding.
Adversarial Knowledge Removal: Employs a gradient reversal layer to prevent the residual channel $Z_c$ from predicting any of the predefined knowledge variables $Y_k$ , enforcing invariance.
Redundancy Reduction Loss: Penalizes the cross-correlation between different knowledge channels to ensure they capture complementary, non-overlapping information.

2.4. Knowledge Dimensions

The study focuses on eight core knowledge dimensions derived from protein structures:

Secondary Structure (SS)
Solvent Accessibility (ASA)
Flexibility (B-factor)
Packing Density (Weighted Contact Number)
Hydrophobicity (Kyte-Doolittle scale)
Stability (Hydrogen bond statistics)
Complexity (Contact entropy)
Curvature (Ollivier-Ricci Curvature)

3. Key Results

3.1. Representation Decoupling

Feature-level analyses confirm that ProtDiS produces channels that are:

Specific: Each channel exhibits high mutual information with its target label and low information with others (diagonal structure in MI heatmaps).
Independent: Pairwise distance correlation coefficients between different knowledge channels are consistently low, indicating effective disentanglement.
Complete: Progressive reconstruction experiments show that adding channels monotonically decreases reconstruction loss, confirming that the set of channels collectively preserves the full information content of the original embedding.

3.2. Downstream Performance

ProtDiS was evaluated on twelve downstream tasks (including enzyme classification, Gene Ontology prediction, SCOP classification, and ligand binding) under two split schemes: random and structure-based.

Structure-Based Splits: The most significant improvements were observed under structure-based splits (where test proteins have low structural similarity to training data). For example, enzyme class (EC) prediction accuracy increased by 6.05%, and ligand-binding affinity prediction improved by 4.45% compared to the baseline ESM-3 Structural Tokenizer.
Generalization: The model consistently outperformed baselines across most tasks, demonstrating that disentangled, knowledge-aligned features generalize better to unseen structural folds.

3.3. Differentiation of Structurally Similar Proteins

A critical finding is ProtDiS's ability to distinguish proteins with similar global folds but divergent functions.

High TM-Score Discrimination: In tasks involving enzyme pairs with high structural similarity (TM-score > 0.8), knowledge-guided embeddings maintained high discriminative power (AUC 0.946), whereas standard structural embeddings degraded significantly (AUC 0.868).
Mechanistic Insight: The framework captures fine-grained biophysical variations (e.g., local packing density, contact entropy) that are obscured in global structural embeddings but are critical for functional specificity.

4. Significance and Claims

The paper claims that ProtDiS provides a general and interpretable approach for structuring latent spaces in protein structural modeling. Its significance lies in:

Interpretability: By disentangling representations into biologically grounded dimensions, the model offers a "computational lens" to explore structure-function relationships that were previously hidden in entangled deep embeddings.
Robustness: The framework learns representations that are less sensitive to superficial fold similarity and more aligned with functionally relevant mechanisms, explaining its superior performance in out-of-distribution (structure-based) scenarios.
Causal Alignment: The learned channels approximate mechanism-level variables governing protein function, allowing for the study of causal links between structural knowledge components and function.

The authors position this work as a step toward mechanistic biology, suggesting that knowledge-guided decomposition can drive discovery in protein science by making latent spaces transparent and manipulable. They acknowledge current limitations, specifically the reliance on 3D structural data, and propose future work extending these principles to sequence and evolutionary feature spaces.

Learning Protein Structure-Function Relationships through Knowledge-guided Representation Decomposition