Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles

Imagine you are trying to teach a robot to design brand-new proteins. Proteins are like complex, 3D origami made of amino acids, and their shape determines what they do in your body (like fighting viruses or building muscles).

Currently, AI robots are getting pretty good at this, but they have three big problems:

They try to learn geometry and design at the same time, which is like trying to learn how to drive a car while simultaneously learning how to build the engine. It's too much to handle.
They focus too much on tiny details (like individual atoms) and miss the big picture of how the whole shape folds.
They think proteins are static statues. In reality, proteins wiggle, dance, and change shape to do their jobs. The AI doesn't understand this movement.

This paper introduces a new training method called RigidSSL (Rigidity-Aware Self-Supervised Learning) to fix these issues. Think of it as a two-step "boot camp" for the AI before it tries to design anything.

The Core Idea: Treat Proteins Like Rigid Blocks

Instead of treating a protein as a pile of loose atoms, RigidSSL treats each piece of the protein (called a "residue") as a rigid block (like a Lego brick). You can move the whole block or rotate it, but you don't bend the block itself. This simplifies the math and helps the AI understand the "skeleton" of the protein better.

The Two-Step Boot Camp

Phase 1: The "Shake-Up" Training (RigidSSL-Perturb)

The Setup: The AI is shown 432,000 static protein structures from a massive database (like a library of frozen statues).
The Trick: The AI takes a perfect protein and simulates shaking it up. It adds random noise to the position and rotation of every Lego block, creating a "messy" version.
The Lesson: The AI's job is to look at the messy version and figure out how to push the blocks back into their original, perfect shape.
The Result: This teaches the AI the fundamental rules of protein geometry. It learns what a stable, foldable protein looks like. It's like learning the rules of balance by trying to stack blocks that keep falling over.
Outcome: This version of the AI becomes incredibly good at designing stable, reliable proteins that don't fall apart.

Phase 2: The "Dance Class" Training (RigidSSL-MD)

The Setup: The AI is now shown 1,300 videos of proteins moving (called Molecular Dynamics trajectories). These aren't frozen statues; they are proteins wobbling, stretching, and shifting as they would in real life.
The Trick: The AI watches a protein move from one frame to the next and learns the physics of that movement.
The Lesson: This teaches the AI that proteins are dynamic. It learns that a protein isn't just one shape; it's a cloud of possible shapes.
The Result: This version of the AI becomes great at creating diverse and realistic proteins that mimic how nature actually works. It's like learning to dance instead of just standing still.

Why This Matters (The Real-World Wins)

The paper tested this new "boot camp" on two main tasks:

Designing New Proteins (The "Unconditional" Task):
- The AI was asked to just "make a new protein."
- Result: The Phase 1 trained AI made proteins that were 43% more likely to be functional (designable) than previous methods. It also managed to create ultra-long proteins (700+ blocks) that stayed stable, which is a huge feat.
Fitting a Key into a Lock (Motif Scaffolding):
- Imagine you have a specific key (a functional part of a protein) and need to build a handle (the scaffold) around it.
- Result: The Phase 1 AI was 5.8% better at building the perfect handle without being explicitly taught how to do it for that specific key. It generalized its knowledge perfectly.
Modeling Complex Machines (GPCRs):
- GPCRs are complex protein machines in our cells that act like switches. They are notoriously hard to model because they wiggle a lot.
- Result: The Phase 2 trained AI (the "Dance Class" version) was the best at capturing the realistic wiggles and movements of these machines, producing a much more accurate simulation of how they work in the human body.

The Big Picture Analogy

Think of previous AI models as apprentices who tried to build a house by looking at a pile of bricks and guessing the blueprint. They often built houses that looked okay but collapsed when the wind blew.

RigidSSL is like sending those apprentices to a two-part school:

First, they learn the laws of physics and structural integrity by trying to rebuild a house after a storm (Phase 1). They learn what makes a house stand up.
Second, they watch videos of houses settling into the ground and swaying in the wind (Phase 2). They learn that a house isn't a rigid statue; it breathes and moves slightly.

By the time they graduate, they can design houses that are not only structurally sound but also realistic and adaptable to the environment. This paper proves that teaching AI these specific "physics lessons" first leads to much better protein designs.

1. Problem Statement

The paper addresses three critical limitations in current generative models for de novo protein design:

Coupling of Geometry and Generation: Existing end-to-end frameworks attempt to learn fundamental protein geometry and complex generation mechanisms simultaneously. This tight coupling leads to inefficient optimization and poor generalization to novel or out-of-distribution design tasks.
Inadequate Representations: Most pretraining methods rely on local, non-rigid atomic or fragment-level representations. While sufficient for property prediction, these fail to capture global folding geometry, limiting the transferability of learned representations to generative design tasks.
Lack of Dynamic Data: Current pretraining datasets (e.g., PDB, AFDB) are dominated by static snapshots. Models trained solely on these fail to capture intrinsic conformational flexibility, near-native fluctuations, or transitions between metastable states, which are crucial for modeling dynamic biological processes.

2. Methodology: RigidSSL

The authors propose RigidSSL (Rigidity-Aware Self-Supervised Learning), a two-stage geometric pretraining framework designed to learn transferable geometric priors before generative finetuning.

Core Representation

Instead of modeling individual atoms, RigidSSL treats each amino acid residue as a rigid body. A protein structure is represented as a sequence of rigid transformations in the Special Euclidean group $SE(3)$ :

Translation ( $\vec{t} \in \mathbb{R}^3$ ): Position of the $C_\alpha$ atom.
Rotation ( $r \in SO(3)$ ): Orientation of the residue frame.
Canonicalization: Before processing, structures are aligned to a canonical inertial reference frame (center of mass and principal axes) to ensure translation and rotation interpolation paths are consistent and invariant to global pose.

Two-Phase Pretraining Strategy

RigidSSL employs a sequential pretraining approach using a bi-directional, rigidity-aware flow matching objective.

Phase I: RigidSSL-Perturb (Static Geometry Learning)

Data: 432K structures from the AlphaFold Protein Structure Database (AFDB).
Mechanism: Simulates perturbations on static structures to create a "noisy" view ( $g_1$ $g_{1}$ ) from a clean view ( $g_0$ $g_{0}$ ).
- Translation: Adds Gaussian noise in $\mathbb{R}^3$ .
- Rotation: Samples from an Isotropic Gaussian distribution on $SO(3)$ (IGSO(3)) to ensure manifold-aware, physically plausible rotational noise.
Goal: Learn robust geometric priors and global structural regularities by maximizing mutual information between the clean and perturbed views.

Phase II: RigidSSL-MD (Dynamic Flexibility Learning)

Data: 1.3K molecular dynamics (MD) trajectories from the ATLAS dataset.
Mechanism: Constructs paired views ( $g_0, g_1$ ) by sampling frames separated by a fixed time interval ( $\delta = 2$ ns) from the same trajectory.
Goal: Refine representations to capture physically realistic transitions, conformational fluctuations, and metastable states.

Objective Function

The framework uses Conditional Flow Matching (CFM) to learn a velocity field that drives the system between two conformations ( $g_0 \to g_1$ and $g_1 \to g_0$ ).

Translation: Interpolated via Linear Interpolation (LERP) in $\mathbb{R}^3$ .
Rotation: Interpolated via Spherical Linear Interpolation (SLERP) of quaternions in $SO(3)$ .
Loss: Minimizes the discrepancy between the learned velocity field and the ideal target velocity field for both translation and rotation components, effectively maximizing mutual information between paired conformations.

3. Key Contributions

RigidSSL Framework: A novel two-stage pretraining paradigm that decouples geometric understanding from generative modeling, utilizing rigid body representations to reduce degrees of freedom and enforce physical constraints.
Rigidity-Aware Flow Matching: A specific flow matching objective that jointly optimizes translational and rotational dynamics on $SE(3)$ , respecting the inductive bias that residues behave as rigid bodies.
Multi-Scale Data Integration: Successfully integrates massive static databases (AFDB) with dynamic MD trajectories (ATLAS) to learn both stable geometric motifs and dynamic conformational ensembles.
State-of-the-Art Performance: Demonstrates significant improvements across unconditional generation, motif scaffolding, and conformational ensemble modeling.

4. Experimental Results

The authors evaluated RigidSSL on two backbone generative models: FrameDiff and FoldFlow-2.

Unconditional Protein Generation

Designability: RigidSSL-Perturb improved designability (fraction of structures foldable by ProteinMPNN) by 43% for FoldFlow-2 and 10% for FrameDiff compared to unpretrained baselines.
Long-Chain Generation: RigidSSL-Perturb enabled the generation of ultra-long proteins (700–800 residues) with the best stereochemical quality (lowest Clashscore and MolProbity score), a task where unpretrained models failed.
Diversity: RigidSSL-MD significantly enhanced structural diversity (MaxCluster diversity up by 9.4% for FrameDiff), generating a broader spectrum of secondary structures (coils, mixed $\alpha/\beta$ ) compared to baselines that predominantly generated $\alpha$ -helices.

Motif Scaffolding (Zero-Shot)

RigidSSL-Perturb achieved a 5.8% higher average success rate (15.19% vs. 9.35%) in zero-shot motif scaffolding tasks, demonstrating superior robustness on difficult targets requiring long scaffolds.

Conformational Ensemble Generation (GPCRs)

Applied to G protein-coupled receptors (GPCRs), RigidSSL variants outperformed baselines in capturing biophysical reality.
RigidSSL-Perturb best predicted flexibility (lowest RMSF error) and collective mode coherence.
RigidSSL-MD excelled in capturing higher-order biophysical statistics, achieving the best Jaccard similarity for weak contacts and cryptically exposed residues, indicating a more realistic modeling of transient structural associations.

5. Significance

Paradigm Shift: RigidSSL validates the "pretrain-then-finetune" paradigm for protein design, showing that front-loading geometric understanding yields better generalization than end-to-end training.
Physical Realism: By explicitly modeling residues as rigid bodies and incorporating MD data, the method bridges the gap between static structure prediction and dynamic biological function.
Scalability: The approach scales effectively to large datasets and long protein chains, addressing a major bottleneck in de novo design where current models struggle with long-range dependencies and stereochemical validity.
Complementary Strategies: The paper highlights that RigidSSL-Perturb is optimal for generating stable, foldable designs (high designability), while RigidSSL-MD is superior for exploring conformational landscapes and dynamic ensembles (high diversity and biophysical fidelity). This offers a flexible toolkit for different downstream design goals.