Technical Summary: Yeti – A Compact Protein Structure Tokenizer for Reconstruction and Multi-Modal Generation
1. Problem Statement
The integration of protein sequences, 3D structures, and functional annotations into a unified multi-modal representation is critical for advancing computational biology. However, current approaches face three primary challenges:
- Data Heterogeneity: Biological data spans diverse modalities (biophysical constraints, text, imaging, volumetric data) with distinct statistical properties.
- Incompleteness and Noise: Real-world datasets often suffer from missing modalities, noisy measurements, and uneven coverage across the structure-function space.
- Scalability and Coupling: Most models operate at a single scale or modality, lacking mechanisms to transfer information across levels (atomic to cellular) or to handle coupled processes.
Specifically, while recent generative models treat proteins as 3D objects using continuous coordinate spaces (e.g., SE(3) diffusion), there is a compelling alternative: encoding 3D atomic structures as discrete tokens. This allows for scalable, modality-agnostic training similar to language models. However, existing protein structure tokenizers (e.g., ESM3, DPLM-2, Kanzi) often prioritize reconstruction accuracy over generative capabilities. There is a misalignment where high reconstruction fidelity does not guarantee high-quality token diversity or support for unconditional co-generation of sequence and structure. Furthermore, many state-of-the-art models rely on massive parameter counts or two-stage training paradigms where encoders are frozen, potentially limiting end-to-end optimization.
2. Methodology
2.1 Architecture: Yeti
Yeti (Yielding Encoded Tokens for Intermodality) is a compact, single-stage protein structure tokenizer designed to map continuous 3D atomic coordinates to discrete tokens.
- Input: Protein structures represented as mean-centered Cα coordinates (x∈RL×3).
- Encoder: A Transformer-based encoder utilizing multi-head attention with rotary positional encoding (RoPE) maps inputs to continuous latent embeddings Z∈RL×D.
- Quantization: The embeddings are projected into a quantization space and processed by a Lookup-Free Quantizer (LFQ). The LFQ space is a Cartesian product of binary sub-codebooks (Ci={−1,+1}). Quantization is performed via a sign operation, mapping vectors to discrete indices without the need for a lookup table, yielding a vocabulary size K=2D.
- Decoder: A StripedHyena architecture (a hybrid signal processing model) acts as the decoder. Unlike traditional decoders that rely on SE(3)-invariant components like Invariant Point Attention (IPA), the StripedHyena decoder learns a conditional velocity field to reconstruct structures from noisy coordinates.
2.2 Training Objective: Flow Matching
Yeti is trained end-to-end using a Flow Matching objective, which interpolates between a Gaussian noise distribution (p0) and the target protein structure distribution (pdata).
- Loss Function: The model minimizes the difference between the predicted conditional velocity field vθ(xt,t,c) and the target vector field (x1−x0), where x1 is the clean structure and x0 is Gaussian noise.
- Entropy Regularization: To prevent codebook collapse and ensure high utilization, an entropy penalty is added to the loss function: L=LFM+λLentropy.
- Inference: Classifier-Free Guidance (CFG) is employed during inference. By randomly zeroing out token embeddings during training, the model learns both conditional and unconditional velocity fields, allowing for guided decoding to refine structure accuracy.
2.3 Multi-Modal Co-Generation
To validate the generative utility of Yeti tokens, the authors trained a Masked Diffusion Model (MDM) from scratch. This model jointly generates amino acid sequences and Yeti structure tokens without any pretrained initialization (unlike DPLM-2 which warm-starts from a sequence model, or ESM3 which uses cascaded pipelines).
3. Key Contributions
- Compact, End-to-End Tokenizer: Introduction of Yeti, a single-stage tokenizer trained end-to-end with flow matching, avoiding the two-stage "freeze encoder" paradigm common in prior work.
- High Codebook Utilization: Yeti achieves superior codebook utilization and token diversity compared to larger models (ESM3, DPLM-2) and other quantization methods (FSQ, VQ-VAE), despite having significantly fewer parameters.
- Parameter Efficiency: Yeti achieves competitive reconstruction accuracy with only 62.5M parameters, roughly 10x fewer than ESM3 (648M) and 2x fewer than DPLM-2 (118M).
- Proof-of-Concept Co-Generation: Demonstration that a compact multi-modal model trained from scratch on Yeti tokens can perform unconditional co-generation of protein sequences and structures, achieving results comparable to models 10x larger.
- Analysis of Folding Dynamics: Identification of a hierarchical emergence of secondary structure during the flow-matching decoding trajectory, characterized by early global compaction followed by a "late commitment" phenomenon where global topology consolidates in the final steps.
4. Results
4.1 Codebook Analysis
- Utilization: On an 8,192-entry codebook, Yeti achieves high perplexity and entropy, significantly outperforming ESM3 and Kanzi. On the CATH dataset, Yeti utilizes ~66% of its capacity (Perplexity = 5414), with an intra-structure diversity (Uˉ) of 98.8%, nearly saturating the metric and indicating highly distinct token assignments for residues.
- Comparison: Yeti outperforms DPLM-2 in codebook utilization despite both using LFQ, attributed to the end-to-end training paradigm versus DPLM-2's frozen encoder approach.
4.2 Reconstruction Accuracy
- Metrics: Yeti achieves a TM-score of 0.96 on CAMEO, 0.97 on CASP14, and 0.95 on CASP15 and CATH.
- Efficiency: While ESM3 shows slightly lower RMSD (due to its massive scale), Yeti's TM-scores are comparable or superior, indicating it recovers correct global topology effectively. Yeti's flow-based decoder produces samples with variance in local atomic positions (penalizing RMSD) but consistent global topology, a feature beneficial for multi-conformation generation.
4.3 Unconditional Co-Generation
- Performance: A 224M-parameter model trained from scratch on 2.6M proteins (using Yeti tokens) achieved a self-consistency TM-score (scTM) of 0.70 and pLDDT of 76.12.
- Comparison: This performance is comparable to DPLM-2 (scTM 0.87) and significantly better than ESM3-Open (scTM 0.46), despite DPLM-2 relying on a 650M parameter pretrained sequence model and ESM3 using a cascaded pipeline.
- Designability: ProteinMPNN recovered 30.6% of the original generated sequences from the generated structures, confirming the geometric consistency and designability of the co-generated pairs.
4.4 Flow Trajectory Dynamics
Analysis of the decoding trajectory revealed a non-monotonic emergence of features:
- Early Phase (t<0.6): Rapid global compaction (decreasing Radius of Gyration).
- Late Phase (t>0.9): Rapid coalescence of secondary structure elements (α-helices and β-sheets) and consolidation of global topology.
This suggests the tokenizer effectively decouples coarse-grained density from high-frequency structural details.
5. Significance and Claims
The paper claims that Yeti serves as a compact and expressive foundation for training multi-modal protein models. Its primary significance lies in demonstrating that:
- Discrete Tokenization is Viable for Generation: High-quality, unconditional co-generation of sequence and structure is possible without massive pretraining or cascaded pipelines, provided the tokenizer is expressive and the training objective (flow matching) is aligned with generative goals.
- Efficiency over Scale: A carefully designed, smaller model (62.5M parameters) can outperform or match much larger models (600M+) in token diversity and reconstruction, challenging the notion that scale is the only path to performance in protein modeling.
- Unified Representation: The ability to train a model from scratch on joint sequence-structure data validates the potential for a universal molecular representation that captures the intrinsic coupling between amino acid sequences and 3D structures.
The authors position Yeti not as a final solution for all protein design tasks, but as a proof-of-concept establishing the viability of their tokenization strategy. They acknowledge limitations, such as the reliance on single-conformation structures and the need for further scaling to match the performance of specialized, heavily tuned models like La-Proteina. However, they assert that Yeti provides a scalable, efficient, and effective backbone for future multi-modal protein models capable of co-generating highly plausible sequences and structures.