Imagine you have a very talented chef (a Protein Language Model) who has read every cookbook in the world. This chef is amazing at inventing new recipes (designing new proteins) that could potentially cure diseases or build new materials.

However, there's a glitch. Sometimes, when you ask the chef to write a new recipe, they get stuck in a loop. Instead of writing a complex dish with varied ingredients, they just start repeating the same sentence over and over: "Add salt, add salt, add salt, add salt..." or "Chop the onion, chop the onion, chop the onion..."

In a normal text story, this is just annoying and hard to read. But in the world of proteins, this is a disaster. Proteins are like tiny, folded origami machines. If the recipe is just a long string of the same ingredient, the machine won't fold correctly. It will collapse into a useless, sticky blob that doesn't work.

This paper, titled "Controlling Repetition in Protein Language Models," tackles this exact problem. Here is the breakdown of their solution in simple terms:

The Problem: The "Broken Record" Effect

The authors noticed that these AI chefs often suffer from "pathological repetition." They fall into two bad habits:

The Loop: Repeating short phrases like "ABC-ABC-ABC."
The Monotone: Repeating one single letter forever, like "AAAAAA."

The paper argues that simply telling the AI to "be more random" (a common trick in text generation) doesn't work well here. If you force the AI to be random, it might stop repeating, but it might also start writing gibberish that can't fold into a shape at all. It's like telling the chef to "stop using salt," but then they accidentally stop using any seasoning, making the food taste terrible.

The Solution: UCCS (The "Smart Editor")

The team created a new method called UCCS (Utility-Controlled Contrastive Steering). Think of this not as a rule you give the chef, but as a guide rail you install on the kitchen counter.

Here is how they built this guide rail:

The Training Camp: They gathered two groups of recipes.
- Group A (The Good Ones): Natural proteins that are diverse and fold perfectly.
- Group B (The Bad Ones): AI-generated proteins that were stuck in loops.
The "Utility" Filter: This is the clever part. Usually, bad recipes (loops) are also bad at folding. The team carefully filtered their groups so that the "Bad" group still had some potential to fold, just like the "Good" group. They made sure the only major difference between the two groups was the repetition, not the ability to fold.
Finding the "Repetition Vector": They looked inside the AI's brain (its internal math) to find the specific direction where "repetition" lives. Because they matched the groups so carefully, they found a pure "repetition signal" that wasn't mixed up with "bad folding."
The Steering: When the AI tries to generate a new protein, this method gently pushes its internal thoughts away from that "repetition direction." It's like a subtle nudge that says, "Hey, you're about to say 'AAAA' again; let's try something else instead."

The Results: A Better Chef

The authors tested this on several different AI chefs (models like ESM-3 and ProtGPT2) using different recipe books (datasets).

Before: The AI would often produce sticky, repetitive blobs that couldn't fold.
After (with UCCS): The AI produced diverse, complex sequences that still folded perfectly.

Crucially, their method was better than the old tricks (like just turning up the "randomness" knob). The old tricks either failed to stop the repetition or made the proteins impossible to fold. UCCS managed to stop the repetition without ruining the protein's ability to work.

The Big Picture

This paper is the first to systematically say, "Hey, repetition is a major failure mode for protein AIs, and here is exactly how to measure it and fix it."

They didn't just say "stop repeating." They figured out how to separate the concept of "repetition" from "structural quality" inside the AI's brain, allowing them to fix one without breaking the other. It's a new way to guide these powerful tools so they can design real, working biological machines instead of just looping text.

Technical Summary: Controlling Repetition in Protein Language Models

1. Problem Definition

Protein Language Models (PLMs) have advanced structure prediction and de novo protein design but suffer from a specific failure mode: pathological repetition. Unlike in natural language, where repetition primarily reduces readability, in proteins, it leads to sequences with low structural diversity, unstable folds, and non-functional biological properties.

The paper identifies two canonical forms of this failure:

Motif-level repetition: Short $n$ -gram fragments (e.g., dipeptides or tripeptides) recur cyclically (e.g., AGAGAG).
Homopolymer repetition: Single amino acid residues extend into long runs (e.g., AAAAAA).

Existing solutions from Natural Language Processing (NLP), such as decoding heuristics (temperature scaling, top- $p$ sampling, repetition penalties), often fail in the protein domain. These methods tend to trade off repetition reduction for structural utility, frequently lowering AlphaFold confidence scores (pLDDT, pTM) and rendering generated proteins biologically implausible. Furthermore, repetition and structural utility are tightly entangled in PLMs, making it difficult to disentangle them without retraining.

2. Methodology: Utility-Controlled Contrastive Steering (UCCS)

The authors propose UCCS, a representation-level intervention that steers generation without retraining or modifying decoding heuristics. The method operates on the premise that repetition can be controlled by injecting specific steering vectors into hidden states, provided these vectors are derived from data where repetition is disentangled from structural utility.

2.1 Quantifying Repetition and Utility

To formalize the problem, the paper introduces two complementary metrics:

Repetition Score ( $R(x)$ ): A unified metric aggregating:
- Token-level entropy ( $H_{norm}$ ): Measures global amino acid imbalance.
- Distinct- $n$ ( $n=2,3$ ): Measures local motif diversity.
- Homopolymer diversity ( $R_{hpoly}$ ): Penalizes runs of identical residues longer than a threshold (default $k=4$ ).
Utility Score ( $U(x)$ ): Derived from structural proxies, specifically the average of AlphaFold's pLDDT (local confidence) and pTM (global topology confidence).

2.2 Contrastive Dataset Construction

A core challenge is that high-repetition sequences naturally correlate with low structural utility. To address this, UCCS constructs contrastive datasets ( $D^+$ and $D^-$ ) that maximize the difference in repetition ( $\Delta R$ ) while strictly constraining the difference in utility ( $\Delta U \le \epsilon$ ).

Positive Set ( $D^+$ ): Low-repetition sequences (natural proteins or high-quality PLM generations).
Negative Set ( $D^-$ ): High-repetition sequences (pathological PLM generations).
Selection Strategy: The authors use Pareto-based frontier selection or composite scoring to ensure that selected sequences have comparable structural plausibility despite their differing repetition levels.

2.3 Steering Vector Injection

Once the contrastive sets are established, a steering vector ( $v_L$ ) is computed at a specific transformer layer $L$ :
$v_L = \mathbb{E}_{x \in D^+}[\phi_L(x)] - \mathbb{E}_{x \in D^-}[\phi_L(x)]$
where $\phi_L(x)$ is the sequence-level summary representation (mean-pooled for Masked Language Models, last-token for Autoregressive models).

During inference, this vector is injected into the hidden states:
$\tilde{h}_t^L(x) = h_t^L(x) + \alpha \cdot v_L$
where $\alpha$ is a scalar controlling the steering strength. This operation suppresses pathological repetition patterns while preserving the structural features encoded in the utility-matched dataset.

3. Key Contributions

Systematic Characterization: The first systematic study identifying pathological repetition as a critical failure mode in PLMs, categorizing it into motif-level and homopolymer repetition.
Metric Development: Introduction of a unified repetition score ( $R(x)$ ) and a utility score ( $U(x)$ ) to evaluate the trade-off between diversity and foldability.
UCCS Method: A dataset-guided steering method that disentangles repetition control from biological utility, avoiding the need for model retraining.
Comprehensive Evaluation: Extensive experiments across two model paradigms (Masked LM: ESM-3; Autoregressive LM: ProtGPT2), three datasets (CATH, UniRef50, SCOP), and two generation settings (unconditional and conditional).

4. Experimental Results

The paper evaluates UCCS against baselines including original models, decoding heuristics (temperature, top- $p$ , repetition penalties), and mechanism-level interventions (neuron deactivation, probe steering).

Performance: UCCS consistently achieves the best trade-off, significantly improving the repetition score $R(x)$ $R (x)$ while maintaining or slightly improving the utility score $U(x)$ $U (x)$ .
- On ESM-3, UCCS improved $R(x)$ by up to 55% in unconditional generation on CATH while preserving utility.
- On ProtGPT2, UCCS reduced repetition by up to 20% relative to temperature sampling without violating utility constraints.
Comparison to Baselines:
- Decoding Heuristics: While effective at increasing surface-level diversity, they consistently degraded structural confidence (lower pLDDT/pTM).
- Mechanism-Level Baselines: Neuron deactivation and probe steering showed unstable performance, often failing to improve both metrics simultaneously or degrading utility.
Ablation Studies:
- Dataset Construction: Structured selection (Pareto/Composite) outperformed random sampling, particularly in reducing variance.
- Steering Strength ( $\alpha$ ): Optimal performance was found at $\alpha \approx 1.0 \sim 2.0$ .
- Layer Injection: Effective steering occurred in the later layers of the network (last few layers for ESM-3; last third for ProtGPT2).
Generalization: The method was further validated on ProGen2 (autoregressive) and DPLM (diffusion-based), demonstrating robustness across different generation architectures.

5. Significance and Claims

The paper claims to establish repetition control as a central challenge for PLMs, distinct from the text-based repetition problem due to the direct impact on structural viability.

Principled Approach: The authors argue that dataset-guided contrastive steering offers a principled way to address the entanglement of repetition and utility, which decoding heuristics cannot resolve.
Mechanistic Insight: Through layer-wise probing and neuron-level correlation analysis, the paper reveals that repetition is a distributed representational phenomenon that emerges early, becomes entangled in mid-layers, and sharpens into a logit-aligned attractor state in final layers. ESM-3 is shown to have a more polarized repetition-related subspace than ProtGPT2, explaining its more severe degeneration.
Reliability: The work highlights that reliable protein generation requires interventions that preserve foldability, positioning UCCS as a plug-and-play mechanism that achieves this without the computational cost of retraining.

The authors conclude that while pathological repetition is a newly recognized issue in protein modeling, controlling it via representation steering is a necessary step toward generating biologically viable de novo proteins.

Controlling Repetition in Protein Language Models