Phase Transitions in Unsupervised Feature Selection

Original authors: Jonathan Fiorentino, Michele Monti, Dimitrios Miltiadis-Vrachnos, Vittorio Del Tatto, Alessandro Laio, Gian Gaetano Tartaglia

Published 2026-02-03

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

CC0 1.0

Original authors: Jonathan Fiorentino, Michele Monti, Dimitrios Miltiadis-Vrachnos, Vittorio Del Tatto, Alessandro Laio, Gian Gaetano Tartaglia

Original paper dedicated to the public domain under CC0 1.0 (http://creativecommons.org/publicdomain/zero/1.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to describe a complex object, like a human protein, to a friend. You have a massive list of 150 different facts about it: its weight, its color, how sticky it is, how it folds, how it reacts to heat, and so on. The problem is, many of these facts are redundant (saying "it's heavy" and "it has high mass" is the same thing), and some are just noise.

The researchers in this paper asked a simple question: How many of these facts do we actually need to keep to understand the protein perfectly?

To answer this, they used a mathematical tool called "Differentiable Information Imbalance" (DII). Think of DII as a smart filter that tries to figure out which facts are the most important by seeing how well a small group of facts can mimic the whole group.

Here is what they discovered, explained through a few everyday analogies:

1. The Two Types of "Fact Sets"

The team looked at two different ways of describing proteins:

Physico-chemical features: These are like a list of chemical properties (e.g., "is it oily?", "is it acidic?"). The paper found these facts are highly interconnected. If you know one, you often know the others because they come in "blocks" of related information.
Structural features: These are based on the protein's 3D shape (e.g., "how round is it?", "how many holes does it have?"). These facts are more independent and messy. They don't talk to each other as much; they are more like a random collection of unique details.

2. The "Glass" vs. The "Liquid"

The most fascinating part of the paper is how they described what happens when you start removing facts from these lists. They used concepts from physics (specifically, how materials change state) to explain the results.

For the Chemical Facts (The "Glass" Phase):
Imagine you are trying to solve a puzzle where the pieces are all slightly different shades of the same color.

When you have very few pieces (facts): The picture is blurry and chaotic. There are many different ways to arrange the few pieces you have, and they all look roughly the same (this is called a "glassy" state). It's frustrating because you can't find the right answer; there are too many "almost right" answers.
The Tipping Point: As you add just a few more pieces, suddenly the picture snaps into focus. There is a specific number of pieces where the chaos stops, and the image becomes clear.
The Result: The researchers found a "critical number" of chemical facts. Below this number, the description is messy and unreliable. Once you cross this number, the description becomes perfect, and adding more facts doesn't help much. It's like a light switch: off, then suddenly on.

For the Structural Facts (The "Liquid" Phase):
Now imagine a puzzle where every piece is a completely different shape and color.

The Process: As you add pieces, the picture gets better and better, but it never "snaps" into place. It's a smooth, gradual improvement, like pouring water into a glass. There is no sudden moment where the picture becomes perfect; it just keeps getting clearer the more you add.
The Result: There is no single "magic number" of structural facts that solves the problem. You just need to keep adding them to get better results.

3. The Magic Connection to Prediction

The paper makes a remarkable claim about the "Chemical Facts" (the Glass phase).

They tested if this "tipping point" (the critical number of facts) actually mattered for real-world tasks. They tried to use these facts to teach a computer to classify proteins (e.g., "Is this protein a liquid-liquid phase separator?").

The Discovery: The exact moment where the "glass" turned "liquid" (where the chaos stopped and the picture snapped into focus) was exactly the same moment where the computer's ability to predict the protein's function stopped improving.

Before the tipping point: The computer was confused and made mistakes.
At the tipping point: The computer suddenly became as smart as it could possibly be.
After the tipping point: Adding more facts didn't make the computer any smarter; it just wasted time.

The Bottom Line

The paper shows that for certain types of data (like chemical properties), there is a hidden "sweet spot." If you have too few facts, the data is too messy to use. If you have just enough to reach the "tipping point," you get the maximum possible insight. You don't need the whole massive list; you just need to reach that critical threshold.

For other types of data (like 3D shapes), there is no such sweet spot; you just need to keep gathering as much information as possible.

In short: The researchers found a way to use math to detect a "phase transition" in data. They proved that for chemical descriptions of proteins, there is a specific, minimal number of facts you need to know to understand the whole story, and you can find this number without ever looking at the final answer (labels) first.

Technical Summary: Phase Transitions in Unsupervised Feature Selection

Problem Statement
Identifying minimal and informative feature sets is a fundamental challenge in data analysis, particularly in regimes with limited data points. In protein classification, high-dimensional feature representations derived from sequence and structure are often redundant, strongly correlated, or noisy. While supervised feature selection methods can identify discriminative features, they require labeled data and are prone to overfitting in low-data regimes. Consequently, there is a need for robust, unsupervised criteria to determine the optimal number of features required to capture the intrinsic geometry of the data without relying on downstream task labels.

Methodology
The authors apply a theoretical framework based on the Differentiable Information Imbalance (DII) to unsupervised feature selection. The DII is an information-theoretic quantity that measures how faithfully the neighborhood structure of a reference feature space is reproduced in an input feature space. In this study, the full feature set serves as the reference, and a subset of features serves as the input.

The methodology involves:

Datasets: Four human protein datasets representing distinct functional classes: Liquid-Liquid Phase Separating (LLPS) proteins, RNA-binding proteins (RBPs), membrane proteins, and enzymes.
Feature Types: Two distinct feature sets were analyzed for each dataset:
- Physico-chemical descriptors: Sequence-derived features (82 features) capturing hydrophobicity, aggregation, disorder, and secondary structure propensities. These exhibit near-Gaussian distributions and strong block-wise correlations.
- Structural descriptors: Features (67 features) computed from AlphaFold-predicted structures, including geometric descriptors, disorder, and graph-theoretical features. These are sparser, more heterogeneous, and possess weaker, less structured correlations.
Feature Selection Pipeline: A backward greedy elimination strategy was employed using the DII. The process iteratively removes the least informative feature (identified by the largest DII value) to generate a ranking of feature importance.
Statistical Physics Analysis: The DII value is treated as an order parameter, and the number of retained features ( $F$ $F$ ) acts as a control parameter. The authors analyze the distribution of DII values ( $P(\text{DII}|N, F)$ $P (DII ∣ N, F)$ ) across random subsamples of varying sizes ( $N$ $N$ ) to detect phase transitions. Key metrics include:
- Binder Cumulant ( $U(F)$ ): Used to identify critical points and finite-size scaling effects.
- Finite-Size Scaling: Extrapolating the position of the Binder cumulant minimum ( $F_{min}$ ) to infinite sample size ( $N \to \infty$ ) to define a critical feature number ( $F_c$ ).
Mechanism Dissection: To understand the origins of observed transitions, the authors introduced a tunable model where feature correlations and variances were systematically perturbed using parameters $\beta$ (correlation strength) and $\alpha$ (variance homogenization).
Validation: The unsupervised critical point ( $F_c$ ) was compared against the performance of a supervised binary classifier (Multilayer Perceptron) trained on the selected feature subsets.

Key Results

Distinct Phase Transitions: The study reveals that the nature of the transition between low-information and high-information phases depends critically on the feature type.
- Physico-chemical features: Exhibit a sharp, glass-like phase transition. The DII distribution becomes bimodal at low feature counts, indicating a rugged landscape with competing minima (degeneracy of near-optimal solutions). The Binder cumulant shows a pronounced minimum that shifts with sample size, allowing for the definition of a critical feature number ( $F_c \approx 12$ for LLPS).
- Structural features: Display a gradual crossover rather than a sharp phase transition. The DII distribution remains unimodal, and the Binder cumulant minimum is shallow and less dependent on sample size, suggesting a lack of a well-defined critical point ( $F_c$ is less distinct).
Mechanisms of Criticality:
- For physico-chemical features, the transition is correlation-driven. The block structure of correlations creates frustration and multiple metastable states, analogous to lattice glass models. Suppressing or excessively amplifying these correlations eliminates the phase transition.
- For structural features, the transition is variance-driven. The heterogeneity in feature variances drives the crossover. When feature variances are homogenized, the crossover disappears, even in the absence of correlations.
Alignment with Supervised Performance: A significant finding is that for physico-chemical features, the critical number of features ( $F_c$ ) identified purely through unsupervised DII analysis coincides with the saturation point of binary classification performance (AUROC). Beyond $F_c$ , adding more features yields negligible improvement in classification accuracy. For structural features, classification performance increases smoothly without a clear saturation plateau corresponding to a critical point.

Significance and Claims
The paper establishes a direct link between the statistical properties of feature spaces, criticality, and generalization in protein classification. The authors claim that:

Unsupervised feature selection can be rigorously interpreted through the lens of statistical physics, specifically the theory of disordered systems and glass transitions.
The Differentiable Information Imbalance serves as a natural order parameter that reveals distinct mechanisms of criticality: correlation-driven glass-like transitions for physico-chemical descriptors and variance-driven crossovers for structural descriptors.
The critical point identified in the unsupervised regime ( $F_c$ ) provides a principled, label-free criterion for determining the minimal feature set required for optimal predictive performance. This suggests that the geometry of the feature space alone encodes the limits of generalization.
These results offer a theoretical foundation for understanding feature selection in high-dimensional data, suggesting that informative features act as interacting degrees of freedom subject to competing constraints, with generalization emerging at the edge of a glassy phase.

The work does not propose new experimental protocols but rather provides a theoretical characterization of existing feature selection pipelines, opening the door for future applications of replica symmetry breaking and cavity-based approaches in data analysis.

1. The Two Types of "Fact Sets"

2. The "Glass" vs. The "Liquid"

3. The Magic Connection to Prediction

The Bottom Line

Technical Summary: Phase Transitions in Unsupervised Feature Selection

More like this