Horse, not zebra: accounting for lineage abundance in maximum likelihood phylogenetics

This paper introduces two maximum likelihood methods that incorporate lineage abundance priors—interpreting multifurcations as unresolved signals of common strains or modeling sequencing rates proportional to prevalence—to significantly improve the accuracy of phylogenetic inference for rapidly evolving pathogens like SARS-CoV-2 by prioritizing the placement of sequences onto common ("horse") rather than rare ("zebra") lineages.

De Maio, N.

Published 2026-03-27
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Idea: "When You Hear Hoofbeats, Think of Horses"

Imagine you are a detective walking down a street and you hear the sound of hoofbeats. You have two theories:

  1. The Horse: A common horse is galloping by.
  2. The Zebra: A rare, escaped zoo zebra is galloping by.

In the real world, horses are everywhere, and zebras are rare. Even if the sound is slightly ambiguous, the smartest guess is that it's a horse. This is a famous medical rule: "When you hear hoofbeats, think of horses, not zebras." It means that when symptoms are vague, doctors should first consider common diseases before jumping to rare ones.

This paper is about applying that same logic to tracking viruses.

The Problem: The "Blind" Detective

Scientists use computer programs to build "family trees" (phylogenies) for viruses like SARS-CoV-2. These trees show how different virus samples are related.

Usually, these computer programs act like a detective who only looks at the genetic code. They don't care how many people are infected with a specific strain.

  • The Flaw: If you have a virus sample that looks 99% like a very common strain (the "Horse") and 99% like a very rare strain (the "Zebra"), the computer sees them as equally likely. It gets confused and creates a messy, uncertain tree with lots of "maybe this, maybe that" branches.

In the real world, if a virus strain is infecting 1,000 people, and another is infecting only 1 person, and you find a new sample that looks like both, it is statistically much more likely to belong to the group of 1,000. The old computer programs ignored this "abundance" clue.

The Solution: Two New Tricks (HnZ1 and HnZ2)

The author, Nicola De Maio, created two new methods (called HnZ1 and HnZ2) to teach the computer to "think of horses, not zebras."

1. HnZ1: The "Crowded Room" Analogy

Imagine a room full of people.

  • Scenario A: You see a stranger enter. They look exactly like one of the 500 people wearing a red shirt.
  • Scenario B: They also look exactly like the one person wearing a green shirt.

Without HnZ1, the computer thinks: "Well, they could be a new red-shirt person OR a new green-shirt person. It's a 50/50 toss-up."

With HnZ1: The computer realizes that because there are 500 red shirts, there are 500 different ways this new person could fit into that group. Because there is only 1 green shirt, there is only 1 way they could fit there.

  • The Math: The computer multiplies the "likelihood" of the red-shirt group by 500. Suddenly, the red shirt becomes the obvious choice. It forces the computer to place the new virus sample onto the "crowded" branches of the tree where the virus is already abundant.

2. HnZ2: The "Popularity Contest" Analogy

This method is similar but slightly more aggressive. It assumes that if a virus strain is common, it is not just more likely to be found, but it is also more likely to be the "parent" of new mutations.

  • Think of it like a popular influencer. If a new trend starts, it's more likely to start with the influencer (who has millions of followers) than with a random person who has one follower.
  • HnZ2 gives a massive "bonus score" to placing new samples on branches that are already huge and popular, effectively saying, "If it's popular, it's probably the right place."

Why Does This Matter? (The Results)

The author tested these methods using millions of SARS-CoV-2 genomes. Here is what happened:

  1. Less Confusion: The computer trees became much clearer. Instead of a messy bush with thousands of "maybe" branches, the trees became cleaner, with fewer "zebras" (rare, unlikely placements) and more "horses" (common, likely placements).
  2. Fewer Mistakes: In simulations, the new methods reduced errors by about 40%.
  3. Real-World Impact: When applied to real pandemic data, the uncertainty in the virus's history dropped by ten times.
    • Example: The paper looked at a specific part of the virus (the Delta lineage). Without the new method, the computer thought the virus was constantly flipping back and forth between two genetic states (a confusing mess). With the new method, it realized the virus was stable, and the "flipping" was just an illusion caused by not accounting for how common the strains were.

The Bottom Line

For a long time, virus trackers ignored the fact that common things are more common. They treated a virus found in 1,000 people the same as a virus found in 1 person.

This paper introduces a simple but powerful fix: Tell the computer to bet on the common stuff. By doing so, we get clearer, more accurate maps of how viruses spread and evolve, which helps us fight pandemics better.

It's a bit like realizing that if you find a lost shoe in a city, it's probably a Nike (common) rather than a custom-made, one-of-a-kind shoe (rare), even if they look similar. Once you make that assumption, you can find the owner much faster.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →