k-Nearest Common Leaves algorithm for phylogenetic tree completion

This paper introduces the k-Nearest Common Leaves (k-NCL) algorithm, a Python-based method that completes rooted phylogenetic trees with overlapping taxa by utilizing branch lengths and topology to preserve evolutionary relationships, thereby improving clustering performance compared to existing approaches.

Koshkarov, A., Tahiri, N.

Published 2026-04-04
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: The "Family Tree" Problem

Imagine you are trying to build a massive family tree of all living things. You have two different researchers, let's call them Alice and Bob.

  • Alice has a detailed family tree of Amphibians (frogs, salamanders). She knows exactly how they are related and how long ago they split from each other (branch lengths).
  • Bob has a detailed family tree of Birds. He also knows their relationships and timing.

Now, imagine you want to compare Alice's tree and Bob's tree to see if they agree on how life evolved. But there's a problem: They are looking at different groups of animals. Alice's tree has frogs; Bob's tree has eagles. They only share a few common ancestors (like "Reptiles" or "Vertebrates").

In the past, scientists had two bad options to compare these trees:

  1. The "Pruning" Method: They would chop off all the frogs from Alice's tree and all the eagles from Bob's tree, leaving only the few shared animals.
    • The Problem: This throws away a ton of valuable information. It's like comparing two novels by only reading the first sentence of each.
  2. The "Completion" Method: They would try to guess where the missing animals (frogs in Bob's tree, eagles in Alice's tree) should go.
    • The Problem: Old methods for doing this were like guessing based on a silhouette. They looked at the shape of the tree but ignored the distance (time/evolution). It's like trying to fit a puzzle piece in by looking only at the shape of the edge, ignoring the picture on the piece.

The Solution: The "k-NCL" Algorithm

The authors, Aleksandr and Nadia, invented a new method called k-Nearest Common Leaves (k-NCL). Think of this as a smart GPS for evolutionary history.

Here is how it works, step-by-step:

1. Finding the "Common Ground"

First, the algorithm identifies the animals that appear in both trees (the "Common Leaves"). These are the anchor points, like the shared street corners in two different city maps.

2. The "k-Nearest" Strategy

Now, the algorithm needs to insert a missing animal (say, a Frog) into Bob's Bird tree. How does it know where to put it?

  • It doesn't just guess randomly.
  • It looks at the k nearest neighbors (the "k" stands for a number, usually about half the number of shared animals).
  • It asks: "Which 3 or 4 birds are the Frog's closest evolutionary cousins based on the shared history?"

3. The "Speedometer" Adjustment (Branch Lengths)

This is the paper's biggest innovation.

  • Imagine Alice's tree is drawn on a map where 1 inch = 1 million years.
  • Imagine Bob's tree is drawn on a map where 1 inch = 2 million years.
  • If you just paste Alice's tree onto Bob's, the distances will be wrong.
  • k-NCL acts like a speedometer. It calculates a "scaling factor." If the shared animals in Alice's tree are twice as far apart (in time) as in Bob's, the algorithm stretches or shrinks the new branches to match the target tree's "speed."

4. The "Sweet Spot" Placement

Once the algorithm knows which neighbors to look at and how fast to stretch the branches, it calculates the perfect spot to insert the missing tree.

  • It tries every possible branch in the target tree.
  • It calculates a "discrepancy score" (how much the distances would be off if we put it there).
  • It picks the spot with the lowest score.
  • Analogy: It's like trying to plug a USB cable into a port. You don't just force it in; you wiggle it slightly until it clicks perfectly into place. k-NCL finds that perfect "click."

Why is this better?

The paper tested this new method against the old ways using real data (frogs, birds, mammals, and sharks).

  • Better Clustering: When scientists try to group similar trees together (like sorting books by genre), k-NCL does a much better job. It keeps the "story" of the evolution intact.
  • Preserves History: Unlike the old "pruning" method, it doesn't throw away the unique animals. Unlike the old "completion" methods, it respects the time it took for evolution to happen, not just the shape of the tree.
  • Fast: It's computationally efficient. It can handle large datasets without taking forever to run.

The Takeaway

Think of k-NCL as a universal translator for evolutionary history.

If you have two different maps of the world (one for the Americas, one for Asia) and you want to merge them into one global map, you can't just glue them together; the coastlines won't match. You have to adjust the scale and find the connecting points.

k-NCL does exactly that for the Tree of Life. It takes two different evolutionary stories, finds the common characters, adjusts the timeline so they match, and seamlessly weaves the missing characters into the story without breaking the plot.

In short: It's a smarter, faster, and more accurate way to combine different pieces of the puzzle of life, ensuring we don't lose any pieces or distort the picture while we do it.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →