Ankh-score produces better sequence alignments than AlphaFold3

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to solve a massive jigsaw puzzle, but instead of pictures on the pieces, you have strings of letters (amino acids) that make up proteins. Your goal is to line up two different strings so that the matching letters sit right on top of each other. This is called protein sequence alignment, and it's the foundation of understanding how life works, how diseases develop, and how to design new medicines.

For decades, scientists used a standard rulebook (called BLOSUM matrices) to guess which letters matched. It was like using a basic dictionary to translate a foreign language—good enough, but often missing the nuance.

Recently, two "super-smart" technologies arrived to help solve this puzzle:

AlphaFold3 (The Architect): This AI is a master builder. It looks at the string of letters and predicts exactly what the 3D shape of the protein will look like. The idea was: "If we build the 3D models and line them up physically, we can see which letters match perfectly."
Protein Language Models (The Linguists): These are AIs trained on millions of protein "sentences." They don't just look at the shape; they understand the context and meaning of the letters, much like how a human understands that the word "bank" means something different in a river context versus a money context. One specific model, called Ankh, is the star of this paper.

The Big Race: Who Wins?

The authors of this paper set up a massive tournament to see which method produces the best alignment:

The Old Way: Using the basic rulebook (BLOSUM).
The Architect Way: Using AlphaFold3 to build 3D models, then lining them up.
The Linguist Way: Using the Ankh model to score how well the letters match based on their "meaning."

The Result: The Ankh-score (The Linguist) won by a landslide.

Why Did the "Linguist" Beat the "Architect"?

This is the most surprising part of the story. You might think that knowing the exact 3D shape (the Architect's strength) would be the ultimate way to line things up. But the paper suggests that Ankh knows something the 3D models don't.

Think of it this way:

AlphaFold3 is like a photographer taking a high-resolution 3D photo of a person. It shows you exactly what they look like right now.
Ankh is like a biographer who has read the person's entire life story, their family history, and their diary.

Sometimes, two people might look very similar in a photo (similar 3D shape), but their life stories (evolutionary history) tell you they are actually quite different. Or, two people might look different in a photo but share a deep, hidden family connection.

The paper found that Ankh captures this "life story" information (evolutionary and functional patterns) better than the 3D shape alone. Even when the 3D models were perfect, the "biographer" (Ankh) could still line up the letters more accurately because it understood the context of the letters, not just their physical position.

A Real-Life Example from the Paper

The authors showed three specific cases where the "Architect" (AlphaFold3) got it wrong, but the "Linguist" (Ankh) got it right:

The Twin Towers: Imagine two proteins that look like towers with two floors. AlphaFold3 got confused and matched the top floor of one tower to the bottom floor of the other. Ankh correctly matched top-to-top and bottom-to-bottom.
The Long and Short: One protein was a long rope with a knot, and the other was a short string with a similar knot. AlphaFold3 tried to match the knot to the middle of the long rope. Ankh correctly matched the knot to the end of the long rope, realizing the rest of the rope was just extra fluff.
The Double Trouble: Two proteins had two matching parts. AlphaFold3 matched the first part perfectly but then completely lost the second part, treating it as if it didn't exist. Ankh matched both parts perfectly.

The "Experimental" Twist

The paper also tested a wild hypothesis: What if we used real, physical proteins (from a lab) instead of computer predictions? You'd think real life would beat the AI. Surprisingly, in a small test, the computer-predicted structures (AlphaFold3) actually lined up slightly better than the real lab structures.

Why? The authors aren't sure yet, but they suggest that maybe the computer models are "smoothing out" the noise, while real lab data is a bit messy. This is a mystery they want to solve next.

The Bottom Line

If you want to line up protein sequences today, don't just look at the 3D shape. You need to understand the "language" of the protein.

AlphaFold3 is amazing at building the house.
Ankh is amazing at understanding the family living inside.

For the specific job of lining up the letters (sequence alignment), understanding the family (Ankh) is currently more powerful than just looking at the house (AlphaFold3).

The authors have made their "Linguist" tool free for everyone to use, so scientists can now line up proteins with much higher accuracy than ever before.

1. Problem Statement

Protein sequence alignment is a foundational task in bioinformatics, critical for downstream applications such as evolutionary analysis, function prediction, and drug discovery. While traditional methods rely on substitution matrices (e.g., BLOSUM) and dynamic programming, recent advancements have introduced two revolutionary approaches:

Structure-based alignment: Using high-accuracy predicted structures from AlphaFold3 to induce sequence alignments (via structural superposition).
Embedding-based alignment: Using Protein Language Models (PLMs) like Ankh, ProtT5, or ESM-C to generate high-dimensional vector embeddings for amino acids, where alignment scoring is based on the cosine similarity of these vectors.

The paper addresses the question: Which of these modern approaches (AlphaFold3 structural alignment vs. PLM-based scoring) yields superior sequence alignments compared to traditional methods and each other?

2. Methodology

2.1. Methods Compared

The authors conducted a rigorous comparison of three primary alignment strategies:

Traditional (BLOSUM): Dynamic programming with affine gap penalties using BLOSUM matrices (specifically BLOSUM45, which performed best among BLOSUM variants).
AlphaFold3 + US-align (AF3US):
1. Predict structures for protein sequences using AlphaFold3.
2. Align the predicted structures using US-align (identified as the superior structural aligner compared to DALI and Foldseek).
3. Derive the sequence alignment from the residue pairs aligned in the structural superposition.
Ankh-score:
1. Generate embeddings for amino acid residues using the Ankh protein language model.
2. Calculate the cosine similarity between the embedding vectors of two residues ( $a_1, a_2$ ) to define the substitution score:
  $\text{Ankh-score}(a_1, a_2) = \frac{v_1 \cdot v_2}{\|v_1\|\|v_2\|}$
3. Perform dynamic programming with affine gap penalties using these similarity scores.

2.2. Datasets and Evaluation

Datasets: A diverse set of protein domains was selected from BAliBASE and the Conserved Domain Database (CDD), covering various sequence identity levels.
Reference: Ground truth alignments were provided by the reference Multiple Sequence Alignments (MSAs) in these databases.
Metrics: Four distance metrics were used to measure the deviation of computed alignments from the reference:
1. $d_{ia}$ (Inter-alignment distance): Area between alignment paths.
2. $d_d$ (Relative displacement): Sum of position differences.
3. $d_{cc}$ : Distance to the closest position with the same context.
4. $d_{pos}$ : A metric inspired by sum-of-pairs scores that accounts for gap position and sequence context (deemed the most relevant).
Statistical Analysis: Pairwise comparisons were evaluated using the Wilcoxon signed-rank test ( $p < 0.01$ considered significant).

2.3. Robustness Checks

Gap Penalty Sensitivity: Tests confirmed that Ankh-score performance is robust across a range of gap opening and extension penalties.
PLM Selection: Ankh was selected as the best PLM after outperforming ProtT5, ProstT5 (structure-finetuned ProtT5), and ESM-C in head-to-head tests.

3. Key Results

3.1. Overall Performance

Ankh-score is the clear winner: It consistently produced the most accurate sequence alignments across all datasets (BAliBASE and CDD) and all distance metrics.
Ranking: The performance hierarchy is Ankh-score > AF3US (AlphaFold3 + US-align) > BLOSUM matrices.
AF3US Behavior:
- AF3US outperforms traditional BLOSUM methods, particularly when structural similarity is high (TM-score > 0.5).
- However, AF3US struggles with low-identity sequences and shows "wavering" performance even at high identity levels in some cases, failing to converge to the accuracy of Ankh-score.
- Even when filtering for only high-quality structural alignments (TM-score > 0.5), Ankh-score remains superior.

3.2. Head-to-Head Statistics

Ankh vs. AF3US: Ankh-score won 78.75% of domains, while AF3US won only 10.63%.
AF3US vs. BLOSUM: AF3US won 59.38% of domains against BLOSUM45.
Other PLMs: While Ankh was the best, other PLMs (ProstT5, ProtT5, ESM-C) also generally outperformed AF3US, suggesting the advantage lies in the embedding methodology rather than just the specific Ankh model.

3.3. Case Studies

The authors provided three specific examples where Ankh-score matched the reference alignment perfectly, while AF3US failed:

MTSS1 vs. Spire: AF3US aligned the wrong WH2 domain (first vs. second), whereas Ankh and the reference aligned the correct homologous domains.
HT16 vs. SH2 SAP: AF3US aligned the short SH2 domain to the wrong, much longer SH2 domain in the target protein. Ankh correctly identified the homologous short domain.
YxjL vs. DegU: AF3US aligned the first domain correctly but completely missed the second domain alignment. Ankh aligned both domains correctly.

3.4. Experimental Structures vs. Predicted Structures

In a limited test using experimentally determined structures (from the cd14765 domain), AF3US surprisingly outperformed the alignment of experimental structures (41.82% wins vs. 35.45% wins). While the sample size was small, this suggests that AlphaFold3's predicted structures may contain information or "regularization" that aids alignment better than noisy experimental data in certain contexts, though Ankh-score still dominated both.

4. Key Contributions

Benchmarking: Provided the first comprehensive comparison of AlphaFold3-derived structural alignments against PLM-based sequence alignments and traditional methods.
Methodological Superiority: Demonstrated that Ankh-score (using cosine similarity of embeddings) is currently the state-of-the-art method for protein sequence alignment, outperforming the structural alignment of AlphaFold3 predictions.
Informational Insight: Proposed the hypothesis that Protein Language Models (PLMs) capture evolutionary and functional information embedded in the sequence context that is not fully recoverable from the 3D structures predicted by AlphaFold3.
Tool Availability: Released the Ankh-score alignment software as a free web server and open-source code.

5. Significance and Implications

Paradigm Shift: The results challenge the long-held assumption in bioinformatics that "structure is more conserved than sequence" and that structural alignment is the gold standard. The study suggests that for sequence alignment tasks, contextual embeddings from PLMs provide a richer signal than static 3D structures.
Future Directions: The finding that PLMs possess information absent in AlphaFold3 structures suggests a need to investigate what specific features PLMs capture (e.g., co-evolutionary signals, functional constraints) that structural prediction models might miss or smooth over.
Practical Application: Researchers requiring high-accuracy sequence alignments for homology detection or phylogenetic analysis should prioritize PLM-based scoring (specifically Ankh) over structural alignment pipelines, even when using AlphaFold3.

In conclusion, the paper establishes that Ankh-score is the superior method for protein sequence alignment, indicating that the "language" of proteins contains alignment-critical information that current structural prediction models do not fully leverage.