ProtNHF: Neural Hamiltonian Flows for Controllable… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a master chef trying to invent a new recipe. You have a massive cookbook (a database of all known proteins) and a powerful AI assistant that can taste millions of dishes and learn the "rules" of what makes a dish delicious and safe to eat.

Most current AI chefs work like this: If you want a spicy dish, you have to teach the AI to cook spicy food from scratch. If you want a sweet dish, you have to retrain it again. If you want a dish with exactly 5% less salt, you have to start over. It's slow, expensive, and rigid.

ProtNHF is a new kind of AI chef that works differently. Instead of retraining the chef every time you change your mind, it gives the chef a set of adjustable dials that you can tweak while the dish is being cooked.

Here is how it works, broken down into simple concepts:

1. The "Hamiltonian" Kitchen (The Physics of Cooking)

The authors use a concept from physics called Hamiltonian Dynamics. Think of this as a perfectly balanced kitchen where energy is never lost.

The Potential Energy (The Recipe): The AI has learned a "recipe" for what a good protein looks like. It knows which ingredients (amino acids) usually go together.
The Kinetic Energy (The Momentum): Imagine the cooking process has momentum. Once the AI starts mixing ingredients, it keeps moving forward smoothly, like a skater gliding on ice.
The Result: Because the physics are so balanced, the AI can generate thousands of new, valid recipes (protein sequences) very quickly without getting stuck or making nonsense.

2. The Magic of "Inference-Time" Control

This is the paper's big breakthrough. Usually, to change a recipe, you have to rewrite the cookbook. ProtNHF doesn't need that.

Imagine the AI is generating a protein sequence. At the very last second, before the dish is served, you can add a bias (a gentle nudge).

The Analogy: Imagine the AI is driving a car down a highway (generating a random protein). You don't need to rebuild the car or change the driver's training. You just gently turn the steering wheel or press the gas pedal while the car is moving.
The Tools: The paper introduces three types of "steering wheels":
- Coulomb Bias (The Repeller): Like a magnet that pushes away specific ingredients. If you want fewer "Lysine" ingredients, you turn up the magnet, and the AI naturally steers away from them.
- Gaussian Bias (The Attractor): Like a magnet that pulls specific ingredients closer. If you want more "Aspartic Acid," you turn up the pull, and the AI adds more of it.
- Harmonic Bias (The Anchor): Like a leash that forces a specific ingredient to stay in a specific spot (e.g., "The first ingredient must be Methionine").

3. Why This is a Big Deal

In the past, if a scientist wanted a protein that was slightly more acidic or had a specific charge, they had to:

Take a huge model.
Retrain it for weeks on a supercomputer.
Hope it worked.

With ProtNHF, they just:

Take the pre-trained model.
Turn a dial (a simple number) to say, "Make it slightly more acidic."
Get the result instantly.

4. Does it actually work?

The authors tested this by generating thousands of fake proteins.

Quality: The proteins looked real. When they used a tool called AlphaFold (which predicts what a protein looks like in 3D), the AI's creations folded into stable, sensible shapes, just like real biological proteins.
Control: When they turned the "acidic" dial, the proteins actually became more acidic. When they turned the "positive charge" dial, the proteins became more positive. The control was smooth and predictable, like turning a volume knob rather than flipping a light switch.

The Bottom Line

ProtNHF is like giving a generative AI a "remote control" for its output. Instead of building a new robot for every new task, you build one smart robot and give it a remote that lets you steer its behavior in real-time.

This is a game-changer for protein engineering. Scientists can now design custom proteins for medicine, enzymes for cleaning up pollution, or new materials by simply "tuning" the AI, rather than spending months retraining it. It turns the complex art of protein design into something as flexible as adjusting the settings on a thermostat.

1. Problem Statement

Controllable protein sequence generation is a critical challenge in computational protein design. While existing generative models (diffusion, autoregressive, flow-based) can produce diverse and structurally plausible proteins, they struggle with quantitative, continuous control over global sequence properties (e.g., amino acid composition, net charge, solubility) without significant computational overhead.

Current methods typically rely on:

Retraining/Fine-tuning: Modifying the model for every new target property.
Architectural Changes: Adding specific conditioning tags or auxiliary predictors.
Classifier Guidance: Using external predictors during inference, which can be unstable or require retraining.

These approaches lack flexibility and are computationally expensive. There is a need for a framework that allows inference-time programmability where desired properties are enforced by explicitly shaping the energy landscape, analogous to classical molecular modeling, without retraining the base model.

2. Methodology: ProtNHF

The authors introduce ProtNHF, a generative model based on Neural Hamiltonian Flows (NHF). The core innovation is treating protein sequence generation as a dynamical system in a continuous phase space, allowing for control via analytical bias potentials.

A. Continuous Relaxation of Sequence Space

Since NHFs operate on continuous variables but protein sequences are discrete (20 amino acids), the authors employ an argmax flow technique to create a bijective mapping between discrete sequences ( $x$ ) and continuous embeddings ( $q$ ):

Encoding: A discrete one-hot vector $x$ is mapped to a continuous space $q$ using a neural network to generate Gaussian noise $u$ , followed by a thresholding mechanism.
Reversibility: The mapping is exactly reversible ( $x = \text{argmax}(q)$ ), allowing the model to sample continuous vectors and deterministically convert them back to valid amino acid sequences.

B. Hamiltonian Dynamics Architecture

The model learns a symplectic transport map from a latent Gaussian distribution ( $\pi_0$ ) to the target protein sequence distribution ( $\pi_T$ ).

Hamiltonian Definition: $H(q, p) = V_\theta(q) + K(p)$ , where $V_\theta$ is the potential energy (learned) and $K(p)$ is the kinetic energy (fixed).
Potential Energy ( $V_\theta$ ): Parameterized by a lightweight Transformer architecture (inspired by ESM-2, ~8M parameters) using Performer attention (linear attention) to handle sequence length efficiently. The transformer outputs an "energy" scalar for the protein.
Dynamics: The system evolves using Leapfrog integration (4 steps, $\Delta t = 0.05$ $Δ t = 0.05$ ).
- Training: Runs forward from target data to latent space.
- Sampling: Runs backward from latent Gaussian noise to generate sequences.

C. Inference-Time Controllability (Energy Shaping)

The key mechanism for control is the additive nature of the Hamiltonian. During the sampling (reverse) process, an external bias potential $U(q)$ is added to the learned Hamiltonian without modifying the trained weights:
$H_b(q, p) = H(q, p) + k \cdot U(q)$
Where $k$ is a scalar bias strength. The authors implement three types of analytical bias potentials:

Coulomb Bias: $U(q) = \sum \frac{1}{\sqrt{\|q_i - r\| + \epsilon}}$ . Used to repel or attract specific residue embeddings (e.g., reducing Lysine content).
Gaussian Bias: $U(q) = \sum \exp(-\frac{\|q_i - r\|^2}{2\sigma})$ . Used to enrich or deplete specific residues monotonically.
Harmonic Bias:
- Positional: Forces a specific residue at a specific position (e.g., N-terminal Methionine).
- Global: Enforces global properties (e.g., net charge) by defining $U(q) = \frac{1}{2}[F(q) - F^*]^2$ , where $F(q)$ is a differentiable function of the embeddings (e.g., calculating net charge via softmax probabilities).

3. Key Contributions

First Application of NHFs to Proteins: Demonstrates that Neural Hamiltonian Flows can effectively model complex protein sequence distributions.
Retraining-Free Control: Establishes a framework where global sequence properties (composition, charge, position) are controlled purely by adjusting analytical bias terms at inference time.
Physically Interpretable Conditioning: Casts controllable generation in a classical molecular modeling paradigm, where "energy shaping" dictates the output distribution, preserving the symplectic structure and invertibility of the flow.
Differentiable Global Properties: Introduces a method to compute differentiable estimates of global properties (like net charge) directly from continuous embeddings, enabling gradient-based control within the Hamiltonian flow.

4. Experimental Results

The model was trained on ~90,000 UniProtKB sequences (lengths 10–128).

Unconditional Generation

Quality: For short sequences (length 20), the model achieves competitive ESM-2 pseudo-perplexity (pppl) and high AlphaFold2 pLDDT scores (90–100), indicating high structural confidence.
Length Scaling: As sequence length increases (up to 50), ESM-2 pppl degrades (approaching noise levels), but structural confidence (pLDDT) remains high (75–80). This suggests the model captures structural coherence even when sequence likelihood decreases.
Diversity: Generated sequences show low low-complexity region (LCR) content at longer lengths, avoiding the "mode collapse" seen in other generative models.

Conditional Generation

Residue Composition:
- Coulomb Bias: Successfully reduced Lysine content continuously as bias strength $k$ increased, with minimal impact on ESM-2 pppl.
- Gaussian Bias: Successfully increased Aspartic Acid content monotonically.
Positional Control: Enforcing an N-terminal Methionine (Harmonic bias) improved ESM-2 pppl and increased secondary structure diversity compared to the unbiased case.
Global Property Control:
- Net Charge: By applying a harmonic bias on the differentiable net charge function, the model generated sequences with target net charges (e.g., 0 or -1) with high precision.
- Structural Integrity: Conditioned sequences maintained or slightly improved AlphaFold pLDDT scores compared to unconditional sequences of the same length, proving that biasing does not compromise structural plausibility.

5. Significance and Conclusion

ProtNHF represents a paradigm shift in controllable protein design. By leveraging the additive energy structure of Hamiltonian dynamics, it decouples the learning of the base protein distribution from the imposition of constraints.

Efficiency: Eliminates the need for retraining or architectural modifications for new design goals.
Interpretability: Uses transparent, physically motivated energy terms (Coulomb, Harmonic) to steer generation, making the control mechanism intuitive for protein engineers.
Future Impact: This approach bridges generative AI with classical molecular dynamics, opening avenues for incorporating complex physics-inspired constraints (electrostatics, structural priors) directly into the generative process. It provides a flexible foundation for designing artificial proteins and functional biomolecules with precise compositional and physicochemical specifications.

ProtNHF: Neural Hamiltonian Flows for Controllable Protein Sequence Generation