Imagine you are trying to teach a computer to understand and create proteins. Proteins are the tiny, complex machines inside our bodies that do almost everything, from digesting food to fighting infections. To a computer, a protein looks like a long, squiggly string of beads (amino acids) that folds into a specific 3D shape.

The problem is that computers usually speak in "discrete" words (like 0s and 1s or letters), but protein shapes are "continuous" (smooth, floating coordinates in 3D space). It's like trying to describe a smooth, flowing river using only a dictionary of jagged rocks.

This paper introduces Yeti, a new tool designed to solve this translation problem. Here is how it works, explained simply:

1. The Problem: The "River vs. Rocks" Dilemma

Existing tools try to turn the smooth river of a protein's shape into a list of rocks (discrete tokens) so a computer can read it. However, most of these tools are like bad translators: they are great at looking at a finished river and saying, "Here are the rocks that make this river," but they are terrible at imagining a new river from scratch. They prioritize accuracy in copying over the ability to create.

2. The Solution: Yeti (The Master Translator)

The authors built Yeti (which stands for Yielding Encoded Tokens for Intermodality). Think of Yeti as a highly efficient translator that turns the smooth 3D shape of a protein into a compact list of "words" (tokens) that a computer can easily understand and manipulate.

The Dictionary: Yeti uses a dictionary of 8,192 unique "shape words."
The Method: Instead of just memorizing shapes, Yeti uses a technique called Flow Matching. Imagine a cloud of dust (random noise) slowly swirling and condensing into a perfect snowflake (the protein). Yeti learns the exact path the dust takes to become the snowflake. This allows it to not just copy shapes, but to generate new, plausible ones.

3. Why Yeti is Special

The paper compares Yeti to other "translators" (like ESM3, DPLM-2, and Kanzi) and finds three major advantages:

It's a Compact Genius: Yeti is surprisingly small. It achieves results similar to models that are 10 times larger. It's like a pocket-sized calculator that solves math problems as well as a room-sized supercomputer.
It Uses Its Whole Dictionary: Many translators get lazy and only use a few words from their dictionary over and over. Yeti uses almost every single word in its 8,192-word dictionary. This means it has a much richer vocabulary to describe complex shapes.
It Can Create from Scratch: The authors trained a new AI model using only Yeti's words and amino acid sequences, starting with zero prior knowledge. This new model could simultaneously invent a protein's sequence (the string of beads) and its 3D shape. It did this without needing a massive pre-trained brain, proving that Yeti's "words" are high-quality enough to build a new protein from nothing.

4. How It Works (The Folding Dance)

The paper also watched how Yeti "folds" a protein during generation. It's not a straight line; it's a dance with two phases:

The Huddle: First, the protein quickly bunches up into a compact ball (like a crowd of people gathering).
The Refinement: Only at the very end does it snap into its final, detailed shape (like the crowd suddenly forming a specific human pyramid).

The Bottom Line

Yeti is a new, efficient, and expressive way to turn protein shapes into computer language. It proves that you don't need a massive, bloated model to understand proteins; you just need a smart, compact tokenizer that speaks the language of 3D shapes fluently. This opens the door for computers to design new proteins that are both structurally sound and functionally useful, all while using less computing power than current methods.

Technical Summary: Yeti – A Compact Protein Structure Tokenizer for Reconstruction and Multi-Modal Generation

1. Problem Statement

The integration of protein sequences, 3D structures, and functional annotations into a unified multi-modal representation is critical for advancing computational biology. However, current approaches face three primary challenges:

Data Heterogeneity: Biological data spans diverse modalities (biophysical constraints, text, imaging, volumetric data) with distinct statistical properties.
Incompleteness and Noise: Real-world datasets often suffer from missing modalities, noisy measurements, and uneven coverage across the structure-function space.
Scalability and Coupling: Most models operate at a single scale or modality, lacking mechanisms to transfer information across levels (atomic to cellular) or to handle coupled processes.

Specifically, while recent generative models treat proteins as 3D objects using continuous coordinate spaces (e.g., SE(3) diffusion), there is a compelling alternative: encoding 3D atomic structures as discrete tokens. This allows for scalable, modality-agnostic training similar to language models. However, existing protein structure tokenizers (e.g., ESM3, DPLM-2, Kanzi) often prioritize reconstruction accuracy over generative capabilities. There is a misalignment where high reconstruction fidelity does not guarantee high-quality token diversity or support for unconditional co-generation of sequence and structure. Furthermore, many state-of-the-art models rely on massive parameter counts or two-stage training paradigms where encoders are frozen, potentially limiting end-to-end optimization.

2. Methodology

2.1 Architecture: Yeti

Yeti (Yielding Encoded Tokens for Intermodality) is a compact, single-stage protein structure tokenizer designed to map continuous 3D atomic coordinates to discrete tokens.

Input: Protein structures represented as mean-centered $C_\alpha$ coordinates ( $x \in \mathbb{R}^{L \times 3}$ ).
Encoder: A Transformer-based encoder utilizing multi-head attention with rotary positional encoding (RoPE) maps inputs to continuous latent embeddings $Z \in \mathbb{R}^{L \times D}$ .
Quantization: The embeddings are projected into a quantization space and processed by a Lookup-Free Quantizer (LFQ). The LFQ space is a Cartesian product of binary sub-codebooks ( $C_i = \{-1, +1\}$ ). Quantization is performed via a sign operation, mapping vectors to discrete indices without the need for a lookup table, yielding a vocabulary size $K = 2^D$ .
Decoder: A StripedHyena architecture (a hybrid signal processing model) acts as the decoder. Unlike traditional decoders that rely on SE(3)-invariant components like Invariant Point Attention (IPA), the StripedHyena decoder learns a conditional velocity field to reconstruct structures from noisy coordinates.

2.2 Training Objective: Flow Matching

Yeti is trained end-to-end using a Flow Matching objective, which interpolates between a Gaussian noise distribution ( $p_0$ ) and the target protein structure distribution ( $p_{data}$ ).

Loss Function: The model minimizes the difference between the predicted conditional velocity field $v_\theta(x_t, t, c)$ and the target vector field ( $x_1 - x_0$ ), where $x_1$ is the clean structure and $x_0$ is Gaussian noise.
Entropy Regularization: To prevent codebook collapse and ensure high utilization, an entropy penalty is added to the loss function: $L = L_{FM} + \lambda L_{entropy}$ .
Inference: Classifier-Free Guidance (CFG) is employed during inference. By randomly zeroing out token embeddings during training, the model learns both conditional and unconditional velocity fields, allowing for guided decoding to refine structure accuracy.

2.3 Multi-Modal Co-Generation

To validate the generative utility of Yeti tokens, the authors trained a Masked Diffusion Model (MDM) from scratch. This model jointly generates amino acid sequences and Yeti structure tokens without any pretrained initialization (unlike DPLM-2 which warm-starts from a sequence model, or ESM3 which uses cascaded pipelines).

3. Key Contributions

Compact, End-to-End Tokenizer: Introduction of Yeti, a single-stage tokenizer trained end-to-end with flow matching, avoiding the two-stage "freeze encoder" paradigm common in prior work.
High Codebook Utilization: Yeti achieves superior codebook utilization and token diversity compared to larger models (ESM3, DPLM-2) and other quantization methods (FSQ, VQ-VAE), despite having significantly fewer parameters.
Parameter Efficiency: Yeti achieves competitive reconstruction accuracy with only 62.5M parameters, roughly 10x fewer than ESM3 (648M) and 2x fewer than DPLM-2 (118M).
Proof-of-Concept Co-Generation: Demonstration that a compact multi-modal model trained from scratch on Yeti tokens can perform unconditional co-generation of protein sequences and structures, achieving results comparable to models 10x larger.
Analysis of Folding Dynamics: Identification of a hierarchical emergence of secondary structure during the flow-matching decoding trajectory, characterized by early global compaction followed by a "late commitment" phenomenon where global topology consolidates in the final steps.

4. Results

4.1 Codebook Analysis

Utilization: On an 8,192-entry codebook, Yeti achieves high perplexity and entropy, significantly outperforming ESM3 and Kanzi. On the CATH dataset, Yeti utilizes ~66% of its capacity (Perplexity = 5414), with an intra-structure diversity ( $\bar{U}$ ) of 98.8%, nearly saturating the metric and indicating highly distinct token assignments for residues.
Comparison: Yeti outperforms DPLM-2 in codebook utilization despite both using LFQ, attributed to the end-to-end training paradigm versus DPLM-2's frozen encoder approach.

4.2 Reconstruction Accuracy

Metrics: Yeti achieves a TM-score of 0.96 on CAMEO, 0.97 on CASP14, and 0.95 on CASP15 and CATH.
Efficiency: While ESM3 shows slightly lower RMSD (due to its massive scale), Yeti's TM-scores are comparable or superior, indicating it recovers correct global topology effectively. Yeti's flow-based decoder produces samples with variance in local atomic positions (penalizing RMSD) but consistent global topology, a feature beneficial for multi-conformation generation.

4.3 Unconditional Co-Generation

Performance: A 224M-parameter model trained from scratch on 2.6M proteins (using Yeti tokens) achieved a self-consistency TM-score (scTM) of 0.70 and pLDDT of 76.12.
Comparison: This performance is comparable to DPLM-2 (scTM 0.87) and significantly better than ESM3-Open (scTM 0.46), despite DPLM-2 relying on a 650M parameter pretrained sequence model and ESM3 using a cascaded pipeline.
Designability: ProteinMPNN recovered 30.6% of the original generated sequences from the generated structures, confirming the geometric consistency and designability of the co-generated pairs.

4.4 Flow Trajectory Dynamics

Analysis of the decoding trajectory revealed a non-monotonic emergence of features:

Early Phase ( $t < 0.6$ ): Rapid global compaction (decreasing Radius of Gyration).
Late Phase ( $t > 0.9$ ): Rapid coalescence of secondary structure elements ( $\alpha$ -helices and $\beta$ -sheets) and consolidation of global topology.
This suggests the tokenizer effectively decouples coarse-grained density from high-frequency structural details.

5. Significance and Claims

The paper claims that Yeti serves as a compact and expressive foundation for training multi-modal protein models. Its primary significance lies in demonstrating that:

Discrete Tokenization is Viable for Generation: High-quality, unconditional co-generation of sequence and structure is possible without massive pretraining or cascaded pipelines, provided the tokenizer is expressive and the training objective (flow matching) is aligned with generative goals.
Efficiency over Scale: A carefully designed, smaller model (62.5M parameters) can outperform or match much larger models (600M+) in token diversity and reconstruction, challenging the notion that scale is the only path to performance in protein modeling.
Unified Representation: The ability to train a model from scratch on joint sequence-structure data validates the potential for a universal molecular representation that captures the intrinsic coupling between amino acid sequences and 3D structures.

The authors position Yeti not as a final solution for all protein design tasks, but as a proof-of-concept establishing the viability of their tokenization strategy. They acknowledge limitations, such as the reliance on single-conformation structures and the need for further scaling to match the performance of specialized, heavily tuned models like La-Proteina. However, they assert that Yeti provides a scalable, efficient, and effective backbone for future multi-modal protein models capable of co-generating highly plausible sequences and structures.

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation