Horse, not zebra: accounting for lineage abundance in maximum likelihood phylogenetics

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Idea: "When You Hear Hoofbeats, Think of Horses"

Imagine you are a detective walking down a street and you hear the sound of hoofbeats. You have two theories:

The Horse: A common horse is galloping by.
The Zebra: A rare, escaped zoo zebra is galloping by.

In the real world, horses are everywhere, and zebras are rare. Even if the sound is slightly ambiguous, the smartest guess is that it's a horse. This is a famous medical rule: "When you hear hoofbeats, think of horses, not zebras." It means that when symptoms are vague, doctors should first consider common diseases before jumping to rare ones.

This paper is about applying that same logic to tracking viruses.

The Problem: The "Blind" Detective

Scientists use computer programs to build "family trees" (phylogenies) for viruses like SARS-CoV-2. These trees show how different virus samples are related.

Usually, these computer programs act like a detective who only looks at the genetic code. They don't care how many people are infected with a specific strain.

The Flaw: If you have a virus sample that looks 99% like a very common strain (the "Horse") and 99% like a very rare strain (the "Zebra"), the computer sees them as equally likely. It gets confused and creates a messy, uncertain tree with lots of "maybe this, maybe that" branches.

In the real world, if a virus strain is infecting 1,000 people, and another is infecting only 1 person, and you find a new sample that looks like both, it is statistically much more likely to belong to the group of 1,000. The old computer programs ignored this "abundance" clue.

The Solution: Two New Tricks (HnZ1 and HnZ2)

The author, Nicola De Maio, created two new methods (called HnZ1 and HnZ2) to teach the computer to "think of horses, not zebras."

1. HnZ1: The "Crowded Room" Analogy

Imagine a room full of people.

Scenario A: You see a stranger enter. They look exactly like one of the 500 people wearing a red shirt.
Scenario B: They also look exactly like the one person wearing a green shirt.

Without HnZ1, the computer thinks: "Well, they could be a new red-shirt person OR a new green-shirt person. It's a 50/50 toss-up."

With HnZ1: The computer realizes that because there are 500 red shirts, there are 500 different ways this new person could fit into that group. Because there is only 1 green shirt, there is only 1 way they could fit there.

The Math: The computer multiplies the "likelihood" of the red-shirt group by 500. Suddenly, the red shirt becomes the obvious choice. It forces the computer to place the new virus sample onto the "crowded" branches of the tree where the virus is already abundant.

2. HnZ2: The "Popularity Contest" Analogy

This method is similar but slightly more aggressive. It assumes that if a virus strain is common, it is not just more likely to be found, but it is also more likely to be the "parent" of new mutations.

Think of it like a popular influencer. If a new trend starts, it's more likely to start with the influencer (who has millions of followers) than with a random person who has one follower.
HnZ2 gives a massive "bonus score" to placing new samples on branches that are already huge and popular, effectively saying, "If it's popular, it's probably the right place."

Why Does This Matter? (The Results)

The author tested these methods using millions of SARS-CoV-2 genomes. Here is what happened:

Less Confusion: The computer trees became much clearer. Instead of a messy bush with thousands of "maybe" branches, the trees became cleaner, with fewer "zebras" (rare, unlikely placements) and more "horses" (common, likely placements).
Fewer Mistakes: In simulations, the new methods reduced errors by about 40%.
Real-World Impact: When applied to real pandemic data, the uncertainty in the virus's history dropped by ten times.
- Example: The paper looked at a specific part of the virus (the Delta lineage). Without the new method, the computer thought the virus was constantly flipping back and forth between two genetic states (a confusing mess). With the new method, it realized the virus was stable, and the "flipping" was just an illusion caused by not accounting for how common the strains were.

The Bottom Line

For a long time, virus trackers ignored the fact that common things are more common. They treated a virus found in 1,000 people the same as a virus found in 1 person.

This paper introduces a simple but powerful fix: Tell the computer to bet on the common stuff. By doing so, we get clearer, more accurate maps of how viruses spread and evolve, which helps us fight pandemics better.

It's a bit like realizing that if you find a lost shoe in a city, it's probably a Nike (common) rather than a custom-made, one-of-a-kind shoe (rare), even if they look similar. Once you make that assumption, you can find the owner much faster.

1. Problem Statement

Maximum likelihood (ML) phylogenetic methods are standard for reconstructing evolutionary histories but traditionally operate under the assumption that sampling is random or lineage-agnostic. They do not incorporate prior hypotheses regarding tree shape or the relative abundance of lineages.

The Gap: In genomic epidemiology (e.g., SARS-CoV-2), sampling is often prevalence-driven. The number of sequenced genomes for a specific strain often reflects its actual abundance in the host population. However, standard ML methods treat all sequences equally, ignoring this abundance data.
The Consequence: In scenarios with high sampling density and low evolutionary divergence (common in pandemics), many genomes are identical or nearly identical. This leads to multifurcations (polytomies) in the inferred tree. When placing an incomplete or ambiguous sequence, standard ML cannot distinguish between placing it on a rare lineage ("zebra") or a common lineage ("horse") if the likelihood scores are identical. This results in high phylogenetic uncertainty and topological errors.

2. Methodology

The author proposes two new approaches, collectively termed "HnZ" (Horse not Zebra), to incorporate lineage abundance into ML phylogenetics. Both methods act as a multiplicative factor (similar to a tree prior in Bayesian inference) applied to the phylogenetic likelihood score, favoring placements on abundant lineages.

Approach 1: HnZ1 (Rescaling by Topological Resolutions)

Concept: Interprets a mutational multifurcation (MM) not as an instantaneous event, but as a lack of signal resolving a set of possible bifurcating topologies.
Mechanism:
- A node of size $n$ (number of descendant branches) represents a multifurcation.
- The number of possible rooted bifurcating resolutions for a node of size $n$ is calculated using the double factorial: $H(n) = (2n - 3)!!$ .
- The HnZ1 score for a tree is the product of $H(n)$ for all nodes.
- Effect: Placing a new sample onto a larger multifurcation (increasing $n$ ) increases the total score more significantly than placing it on a smaller one. This mathematically favors the "horse" (common strain) over the "zebra" (rare strain).

Approach 2: HnZ2 (Tree Prior based on Abundance)

Concept: Directly models the probability of sampling a genome proportional to its abundance.
Mechanism:
- Defines the abundance of a genome at node $i$ as $f_i = n_i / N$ , where $n_i$ is the node size and $N$ is the total number of genomes.
- The prior probability is defined as the product of abundances raised to the power of node size: $\prod (n_i/N)^{n_i}$ .
- The score simplifies to $H(n) = n^n$ (ignoring constants).
- Effect: Similar to HnZ1, this strongly favors placing samples on large multifurcations. It is described as slightly more "aggressive" than HnZ1 in penalizing rare placements.

Implementation

Both methods are implemented in the open-source software MAPLE (v0.7.5.4).
They integrate with Subtree Prune and Regraft (SPR) searches, recalculating node sizes and scores dynamically during tree optimization.
To manage computational load, log-scores are stored in look-up tables, and node sizes are only updated when topology changes occur.

3. Key Contributions

Theoretical Framework: Introduces a novel interpretation of multifurcations in ML phylogenetics, treating them as sets of unresolved bifurcating trees rather than single topological events.
Algorithmic Innovation: Develops two distinct scoring functions (HnZ1 and HnZ2) that allow standard ML algorithms to utilize lineage abundance data without switching to computationally expensive Bayesian MCMC methods.
Software Integration: Successfully integrates these methods into MAPLE, making them accessible for large-scale genomic epidemiology.
Empirical Validation: Demonstrates that incorporating abundance data significantly reduces phylogenetic uncertainty and improves topological accuracy in both simulated and real-world pandemic data.

4. Results

Simulation Benchmarks

Accuracy: Both HnZ1 and HnZ2 significantly improved phylogenetic accuracy compared to standard ML. HnZ1 prevented approximately 40% of topological inference errors.
Computational Cost: The methods increased inference time by roughly 2x (due to the need to keep more genomes in the analysis and longer SPR searches) but had a negligible impact on memory usage.

Real-World Application (SARS-CoV-2)

The methods were applied to a global dataset of 2,072,111 SARS-CoV-2 genomes.

Reduction in Uncertainty:
- Without HnZ, ~6.91% of substitutions had low support (<50%).
- With HnZ1, this dropped to ~1.04%.
- For terminal branches (often the most uncertain), low support dropped from 8.39% to **0.11%**.
Case Study: AY.4 Delta Sub-lineage:
- Previous analyses showed complex, uncertain evolutionary histories with frequent reversions (e.g., T17040C $\to$ C17040T $\to$ T17040C).
- With HnZ1, the inferred history became much simpler: the number of inferred reversions dropped drastically (e.g., from 655 C17040T reversions to 40).
- Major sub-clades achieved 100% support with HnZ1, compared to <10% without it.
Mechanism of Improvement: The methods correctly identified that mutations occurring in high-prevalence genomic backgrounds are more probable than those occurring in rare backgrounds, resolving ambiguities that standard ML could not distinguish.

5. Significance

"Horse not Zebra" Principle: The paper successfully translates a fundamental medical diagnostic heuristic into a rigorous phylogenetic algorithm. It prioritizes common evolutionary histories over rare ones when evidence is ambiguous.
Scalability vs. Accuracy: It offers a "middle ground" between standard ML (fast but ignores abundance) and full Bayesian inference (accurate but computationally prohibitive for millions of sequences). HnZ methods provide near-Bayesian accuracy with ML-level scalability.
Impact on Genomic Epidemiology: By reducing phylogenetic uncertainty by an order of magnitude, these methods improve downstream analyses, including:
- Lineage assignment (e.g., Pango lineage calling).
- Phylodynamics and transmission history reconstruction.
- Identification of mutation rates and fitness effects.
Broader Applicability: While tested on SARS-CoV-2, the approach is applicable to any scenario with dense sampling where sequence abundance reflects population prevalence, such as metagenomics, single-cell genomics, and cancer genomics.

In conclusion, De Maio demonstrates that explicitly modeling lineage abundance in maximum likelihood frameworks resolves critical ambiguities in pandemic-scale phylogenetics, leading to more accurate and confident evolutionary reconstructions.