Improved inference of multiscale sequence statistics in generative protein models

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: Teaching a Robot to Cook Like a Master Chef

Imagine you want to teach a robot to cook a perfect steak. You give it a cookbook with 300 recipes (the data). However, the cookbook is incomplete, the pages are smudged, and some recipes are just copies of others.

The robot's goal is to learn the "rules" of cooking so well that it can invent new recipes that:

Taste good (Functional/Fidelity).
Are different from the ones in the book (Novelty).
Cover a wide variety of flavors, not just one type (Diversity).

This is exactly what scientists are trying to do with proteins. Proteins are the "machines" of life, made of chains of amino acids. Scientists have a "cookbook" of natural proteins (a Multiple Sequence Alignment), but it's messy and incomplete. They want to build a computer model that can invent new, working proteins.

The Problem: The "One-Size-Fits-All" Filter

For a long time, scientists used a standard method (called Boltzmann Machine or BM) to teach the computer. To stop the computer from memorizing the messy cookbook too perfectly (a problem called overfitting), they used a "filter" called Regularization.

Think of this filter like a sieve or a noise-canceling headphone.

The Issue: The natural world has two types of "noise" or patterns:
1. Big, loud patterns: Groups of ingredients that must change together to make the dish work (like salt and pepper). In proteins, these are called sectors.
2. Small, quiet patterns: Specific ingredients that just need to touch each other to hold the shape together (like a pinch of salt on a specific spot). In proteins, these are local contacts.

The old "sieve" (standard regularization) was too blunt. It treated the loud patterns and the quiet patterns exactly the same.

If the sieve was too coarse, it let in too much noise, and the robot made broken, non-functional proteins.
If the sieve was too fine, it blocked the quiet patterns, and the robot only made proteins that looked exactly like the old ones (no novelty) or lost the ability to hold their shape.

To fix this, scientists used a "band-aid" solution: they turned up the volume on the loud patterns after the fact. But this made the proteins less diverse. It was like forcing the robot to only cook steaks because it was scared to try anything else.

The Solution: The "Smart Stochastic" Filter (sBM)

The authors of this paper introduced a new method called the Stochastic Boltzmann Machine (sBM).

Instead of using a blunt, one-size-fits-all sieve, the sBM is like a smart, adaptive chef. It learns to treat the "loud" and "quiet" patterns differently, right while it is learning.

How does it work? (The Three Magic Tricks)
The sBM uses three tricks to learn without getting confused:

Stopping Early (Early Stopping): Imagine you are learning a song. If you practice for 10 hours, you might memorize the specific mistakes the teacher made. If you stop after 2 hours, you've learned the essence of the song without the noise. The sBM stops learning before it gets too obsessed with the tiny details of the specific 300 recipes.
Feeling the Terrain (Curvature): When walking down a hill, sometimes the ground is steep (a small step changes your position a lot), and sometimes it's flat (you can walk a long way without changing much). The sBM "feels" the ground. It knows that some protein rules are steep and critical, while others are flat and flexible. It adjusts its steps accordingly.
Sampling with a Grain of Salt (Limited Chains): Instead of trying to count every single grain of sand on the beach to understand the beach, the sBM takes a few handfuls of sand, looks at them, and makes a guess. It intentionally uses a "small sample size" during its learning process. This forces the model to be humble and not over-calculate, which naturally prevents it from getting confused by the noise.

The Results: A Better Robot Chef

When they tested this new method on a specific enzyme (Chorismate Mutase, which helps make amino acids), the results were amazing:

Old Method (BM): To get the robot to make working proteins, they had to force it to be very strict. The result? The proteins worked, but they were all clones of each other. They had no creativity.
New Method (sBM): The robot made proteins that worked just as well as the old ones, BUT they were also:
- More diverse: They looked like a whole new family of proteins.
- More novel: They were distinct from the original cookbook.
- No "Band-Aids": It didn't need any post-processing tricks to fix the mistakes.

The Takeaway

This paper is a breakthrough because it shows that when we try to model complex, messy biological systems, we shouldn't use blunt tools that treat everything the same.

By using a method that respects the different scales of the data (the big groups vs. the small contacts), we can build AI models that don't just copy nature, but can invent new, functional, and diverse biological machines. It's the difference between a photocopier (the old way) and a creative artist (the new way).

1. Problem Statement

The paper addresses a fundamental challenge in modeling biological data: high dimensionality combined with multiscale statistical structure, exacerbated by undersampling.

The Context: Protein sequences exhibit correlations across multiple scales. These include:
- Collective correlations: Groups of residues (sectors) that co-evolve to maintain function (e.g., catalysis, allostery).
- Localized correlations: Specific residue pairs that form physical contacts for structural stability.
The Bottleneck: Standard generative models (like Potts models) are inferred from Multiple Sequence Alignments (MSAs) where the number of sequences ( $M$ ) is often far smaller than the number of model parameters.
The Failure of Current Methods: To prevent overfitting due to undersampling, standard approaches use uniform regularization (e.g., $L_2$ $L_{2}$ penalties). However, this uniform approach treats all parameters equally, failing to distinguish between the different statistical scales.
- Consequence: Uniform regularization introduces systematic biases. It tends to either overestimate localized contacts (at low regularization) or underestimate collective functional sectors (at high regularization).
- The Workaround: Previous models (e.g., Boltzmann Machines) required post hoc corrections, such as rescaling parameters or lowering the sampling temperature ( $T < 1$ ), to recover functionality. This restored fidelity (functionality) but severely compromised diversity and novelty, producing sequences that were too similar to the training data.

2. Methodology: The Stochastic Boltzmann Machine (sBM)

The authors propose a new inference strategy called the Stochastic Boltzmann Machine (sBM) to replace standard gradient descent with uniform regularization. The sBM introduces implicit regularization through three complementary mechanisms, eliminating the need for explicit $L_2$ penalties:

Early Stopping: The optimization is halted after a finite number of iterations ( $N_{iter}$ ), preventing the model from fully converging to the noisy empirical statistics of the undersampled data.
Approximate Curvature (L-BFGS): Instead of vanilla gradient descent, sBM uses the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm. This incorporates a local approximation of the Hessian (curvature) of the log-likelihood.
- Significance: This allows the optimizer to distinguish between "stiff" directions (high curvature, small changes have large effects) and "sloppy" directions (low curvature). It adapts the step size based on the anisotropic nature of the data's statistical structure.
Limited Sampling (Stochasticity): The model statistics (frequencies) are estimated using a finite number of Monte Carlo chains ( $N_{chains}$ $N_{c hain s}$ ) rather than the theoretical limit ( $\infty$ $\infty$ ).
- Significance: This mimics the undersampling noise present in the real data. By controlling $N_{chains}$ , the authors control the strength of regularization. A smaller $N_{chains}$ introduces more noise, acting as a regularizer that prevents the model from overfitting to specific isolated pairs while preserving collective patterns.

The Update Rule:
The parameter update for couplings $J$ in sBM is given by:
$J^{(n+1)}_{ij} = J^{(n)}_{ij} - \alpha_n H^{(m)}_n \left[ f^{(n), N_{chains}}_{ij} - f^{data, M}_{ij} \right]$
Where $H^{(m)}_n$ is the approximate inverse Hessian, and $f^{(n), N_{chains}}$ is the frequency estimated from a finite number of Monte Carlo samples.

3. Key Contributions

Novel Regularization Strategy: Introduction of sBM, which replaces explicit, uniform $L_2$ regularization with a process-driven, implicit regularization that treats multiscale correlations more evenly.
Resolution of the Fidelity-Diversity Trade-off: The method demonstrates that it is possible to generate sequences that are both functional (high fidelity) and diverse/novel without resorting to post hoc temperature rescaling.
Theoretical and Experimental Validation: The approach is validated on:
1. Synthetic Data: A "teacher-student" setup with known ground-truth parameters (Kleeorin et al. model) containing isolated and collective couplings.
2. Real Biological Data: The chorismate mutase (CM) family, a well-studied enzyme where functionality can be experimentally measured via high-throughput selection assays.

4. Results

A. Synthetic Data (Ground Truth Validation)

Bias Correction: Standard Boltzmann Machine (BM) inference with $L_2$ regularization showed a strong bias: low regularization overestimated isolated pairs, while high regularization suppressed collective sectors. No single regularization parameter could recover both.
sBM Performance: The sBM successfully inferred both isolated and collective couplings with minimal bias across a wide range of $N_{chains}$ values. The inferred parameters closely matched the ground-truth teacher model.
Generative Performance:
- BM: Achieved high fidelity only by lowering the sampling temperature ( $T < 1$ ), which drastically reduced novelty and diversity.
- sBM: Achieved high fidelity (84% functional sequences) with high novelty and diversity at the natural sampling temperature ( $T=1$ ). It found a "sweet spot" in the hyperparameter space ( $N_{chains} \approx 100$ ) that balanced all three metrics.

B. Real Data (Chorismate Mutase Family)

Experimental Setup: The authors generated artificial protein sequences using sBM and BM models and tested them in E. coli for enzymatic activity (growth complementation).
Fidelity vs. Diversity:
- BM: To achieve ~30% functional sequences, BM required sampling at $T=0.66$ . However, this resulted in a diversity score of <25% (sequences were very similar to each other and the training set).
- sBM: With $N_{chains}=70$ (weak regularization), sBM generated ~33% functional sequences at $T=1$ . Crucially, the diversity score was 37%, significantly higher than the temperature-scaled BM.
Statistical Properties: sBM-generated sequences had statistical energies comparable to natural sequences without the need for energy rescaling, and they maintained a broader distribution of sequence identities compared to BM.

5. Significance

Paradigm Shift in Protein Design: The paper challenges the notion that high-fidelity generative models require post hoc rescaling or temperature tuning. It shows that the inference algorithm itself can be tuned to respect the multiscale nature of biological data.
Balanced Representation: By treating collective (functional) and localized (structural) correlations more evenly, sBM produces models that are not just statistically accurate but biologically robust.
Broader Applicability: While demonstrated on proteins, the authors argue that this approach is relevant for any high-dimensional biological system where data is limited and statistical patterns exist across multiple scales (e.g., gene regulatory networks, neural activity).
Practical Impact: The ability to generate diverse, functional proteins without experimental "tweaking" of the model post-inference accelerates the design of novel enzymes and therapeutics.

In summary, the Stochastic Boltzmann Machine provides a more faithful inference framework for generative protein models by leveraging the stochasticity of the inference process itself to regularize the model, thereby solving the long-standing trade-off between generating functional proteins and maintaining sequence diversity.

Improved inference of multiscale sequence statistics in generative protein models

The Big Picture: Teaching a Robot to Cook Like a Master Chef

The Problem: The "One-Size-Fits-All" Filter

The Solution: The "Smart Stochastic" Filter (sBM)

The Results: A Better Robot Chef

The Takeaway

1. Problem Statement

2. Methodology: The Stochastic Boltzmann Machine (sBM)

3. Key Contributions

4. Results

A. Synthetic Data (Ground Truth Validation)

B. Real Data (Chorismate Mutase Family)

5. Significance

More like this

The zoo of the gene networks capable of pattern formation by extracellular signaling

Rhythmic gene expression and behavioral plasticity in harvester and carpenter ants

Cell-Type-Resolved Pseudobulk Classification Across Independent Cohorts Identifies Microglial PTPRG as a Transcriptional Hub in Alzheimer's Disease

Time-dependent memory of hypoxia exposure influences tumor invasion dynamics

Nonlinear mixed-effect models and tailored parametrization schemes enables integration of single cell and bulk data