On the consistency of duplication, loss, and deep… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: Reconstructing a Family Tree from Confusing Clues

Imagine you are a detective trying to reconstruct the family tree of a group of animals (the Species Tree). However, you don't have a single, perfect diary of their history. Instead, you have hundreds of different diaries written by different family members (the Gene Trees).

Here's the problem: These diaries often contradict each other.

Sometimes, a family member had a twin (a Gene Duplication).
Sometimes, a family member died without leaving a descendant (a Gene Loss).
Sometimes, a family member had a child before the family officially split into two distinct branches, causing the child's lineage to "jump" across the family tree (a Deep Coalescence or Incomplete Lineage Sorting).

Because of these messy events, the story told by the "eye color" gene might look different from the story told by the "blood type" gene. Your job is to figure out the true family tree that explains all these conflicting stories.

The Old Strategy: "The Cheapest Explanation Wins"

For a long time, scientists used a method called Gene Tree Parsimony (GTP). Think of this as a detective who believes in Occam's Razor: The simplest explanation is usually the right one.

The detective looks at all the conflicting gene diaries and asks: "Which family tree requires the fewest number of weird events (duplications, deaths, or jumps) to make sense of all the data?"

If Tree A requires 5 weird events to explain the data, and Tree B requires 10, the detective picks Tree A.

This method is popular because it's fast and easy to understand. However, this paper asks a scary question: Is this detective actually good at their job, or are they just guessing?

The Discovery: The Detective is Biased

The authors of this paper (Nicolae Sapoval and Luay Nakhleh) ran a rigorous mathematical test to see if this "cheapest explanation" method actually finds the true family tree when you have infinite amounts of data.

The Verdict: The method is broken. No matter how you mix the rules (counting only duplications, only deaths, or a mix of both), the detective will eventually get stuck on the wrong answer.

They found two specific "traps" (called Anomaly Zones) where the method fails:

The Symmetric Trap: If the true family tree is perfectly balanced (like a fork with two equal prongs), the "Duplication" detective gets confused. It starts thinking the tree is lopsided because it thinks it's cheaper to explain the data that way.
The Asymmetric Trap: If the true family tree is lopsided (like a ladder), the "Deep Coalescence" detective gets confused. It starts thinking the tree is perfectly balanced.

The Analogy: Imagine you are trying to guess the shape of a hidden object by feeling its shadow.

If the object is a perfect sphere, your shadow-reading tool might tell you it's a cube.
If the object is a cube, your tool might tell you it's a sphere.
The paper proves that no matter how you adjust your tool (by weighing the "cube" clues more or the "sphere" clues more), it will always fail in at least one specific situation.

The "Mix-and-Match" Hope? (And Why It Failed)

You might think, "Okay, maybe if we combine the rules? Let's count duplications and jumps together with a specific formula."

The authors proved mathematically that this doesn't work either.
They showed that any linear combination of these costs (adding them up with different weights) is still statistically inconsistent. It's like trying to fix a broken compass by adding a second broken compass to it; the result is still a broken compass.

The Real-World Test: Does it matter in practice?

Since the math says the method is flawed, the authors ran computer simulations to see how bad it is in the real world.

The Setup: They simulated thousands of fake evolutionary histories with different levels of "messiness" (high duplication, high gene loss, or lots of confusing jumps).
The Result:
- When the evolutionary history was very messy (lots of "jumps" or ILS), the old methods failed to converge on the right answer, even with tons of data.
- However, when the "Duplication" cost was given a very high weight (ignoring the confusing jumps), the method performed surprisingly well in many cases.
- Interestingly, a newer method called ASTRAL-Pro (which uses a more complex statistical approach) generally performed better and didn't get stuck in the same traps.

The Takeaway for Everyone

The "Simplest Explanation" isn't always the Truth: In the complex world of evolution, the path with the fewest "weird events" isn't always the correct family tree. The math proves that relying solely on "counting the cheapest events" will lead you astray in specific scenarios.
Don't just tweak the knobs: You can't fix this broken method just by adjusting the weights (e.g., "let's count duplications twice as much"). The fundamental logic is flawed under certain conditions.
Use the right tool: If you are studying groups of species where genes jump around a lot (high ILS), you should probably use modern statistical methods (like ASTRAL) rather than the old "counting events" method.

In short: The paper pulls the rug out from under a very popular, easy-to-use method for building family trees. It tells us that while the method is fast, it is mathematically unreliable in certain situations, and we need to be careful not to trust it blindly.

1. Problem Statement

Phylogenomic species tree inference is complicated by gene tree discordance, where individual gene trees differ from the species tree due to biological processes such as Gene Duplication and Loss (GDL) and Incomplete Lineage Sorting (ILS).

Context: While statistically consistent methods exist (e.g., those based on the Multispecies Coalescent, MSC), Gene Tree Parsimony (GTP) methods remain popular in practice due to their computational efficiency and interpretability. GTP methods infer a species tree by minimizing a reconciliation cost (duplications, losses, or deep coalescences) between gene trees and the species tree.
The Issue: Previous work established that GTP estimators using individual costs (e.g., pure Deep Coalescence or pure Duplication) are statistically inconsistent under the MSC model in specific "anomaly zones" (regions of branch length parameters where the most probable gene tree topology differs from the species tree topology).
The Gap: It was unknown whether linear combinations of these costs (e.g., minimizing a weighted sum of duplications, losses, and deep coalescences) could overcome these inconsistencies. The authors aim to determine if any weighted combination of these costs yields a statistically consistent estimator under the MSC.

2. Methodology

The authors employ a rigorous combination of theoretical proofs and extensive empirical simulations.

A. Theoretical Framework

Definitions: They define the GTP estimator $\hat{S}$ as the species tree minimizing the sum of costs over $m$ gene trees. Consistency is defined as the probability of the estimator converging to the true species tree ( $S_{GT}$ ) as $m \to \infty$ .
Cost Functions: They analyze a generalized cost function:
$c_{wDLX}(G, S) = w_D c_D(G, S) + w_L c_L(G, S) + w_X c_X(G, S)$
Using a known relationship between loss, duplication, and deep coalescence costs ( $c_L = c_X + 2c_D$ ), they reduce the problem to analyzing linear combinations of Duplication ( $c_D$ ) and Deep Coalescence ( $c_X$ ) costs:
$c(G, S) = \alpha c_D(G, S) + \beta c_X(G, S)$
Anomaly Zone Analysis: They utilize the concept of the "anomaly zone," where the expected cost of an incorrect tree topology is lower than that of the true tree.
- They prove that for symmetric species trees, duplication costs are inconsistent.
- They prove that for asymmetric species trees, deep coalescence costs are inconsistent.
Proof Strategy: The authors construct specific 4-taxon species tree topologies (symmetric and asymmetric) and demonstrate that for any non-negative weights $\alpha$ and $\beta$ , there exist branch length parameters (an anomaly zone) where the estimator converges to an incorrect topology. They extend these 4-taxon results to $N \geq 4$ taxa using embedding techniques.

B. Empirical Evaluation

Simulation Setup: They used SimPhy to simulate species trees and gene trees under the MSC with GDL.
- Scenarios: Four scenarios varying ILS levels (effective population size) and duplication/loss rates.
- Data: Simulated DNA sequences (using INDELible), inferred gene trees (using IQ-TREE), and then inferred species trees using GTP.
Comparison: They compared various GTP weighting schemes (varying ratios of $\alpha$ $α$ to $\beta$ $β$ ) against:
- Pure duplication cost.
- Pure deep coalescence cost.
- ASTRAL-Pro 3 (a statistically consistent method for handling paralogs under MSC).
Metrics: Topological error measured via normalized Robinson-Foulds (RF) distance.

3. Key Contributions

Theoretical Inconsistency Proof: The primary contribution is Theorem 1, which proves that no linear combination of gene duplication, loss, and deep coalescence costs yields a statistically consistent estimator under the MSC model for species trees with $N \geq 4$ $N \geq 4$ taxa.
- This implies that even if one attempts to "balance" the errors of duplication and deep coalescence by weighting them, the estimator will still fail in specific anomaly zones.
Complementary Bias Analysis: The paper elucidates the complementary nature of the inconsistencies:
- Duplication costs tend to be inconsistent on symmetric topologies (preferring asymmetric ones in the anomaly zone).
- Deep coalescence costs tend to be inconsistent on asymmetric topologies (preferring symmetric ones).
- Any linear combination inherits this vulnerability; if the weight on deep coalescence is non-zero, the estimator remains susceptible to the asymmetric anomaly zone.
Empirical Validation: The simulations confirm the theoretical results, showing that GTP methods do not consistently converge to the true tree as the number of gene trees increases, unlike consistent methods (ASTRAL-Pro).

4. Results

Theoretical: For any weights $\alpha, \beta \geq 0$ (where at least one is non-zero), there exists a set of branch lengths where the expected cost of an incorrect tree is lower than the true tree.
Simulation (Simulated Gene Trees):
- As the number of gene trees increased, ASTRAL-Pro 3 showed decreasing topological error (consistent behavior).
- GTP methods did not show consistent improvement; error rates often plateaued or fluctuated, confirming inconsistency.
- Weighting Impact: Increasing the weight of the duplication cost ( $\alpha$ ) relative to deep coalescence ( $\beta$ ) generally reduced topological error in the simulations. The best-performing GTP scheme was often pure duplication or a high ratio of duplication-to-deep-coalescence (e.g., 32:1).
Simulation (Inferred Gene Trees): When gene trees were inferred from sequence data (introducing gene tree estimation error), the trends held. High ILS levels were the primary driver of error for GTP methods.
Biological Data (Fungi): On a real dataset of 16 fungi, GTP methods (with various weights) and ASTRAL-Pro 3 produced identical topologies that differed by one split from previous literature, suggesting that in this specific biological case, the anomaly zone might not have been triggered, or the data was robust enough to overcome the inconsistency.

5. Significance

Theoretical Limitation: This work definitively closes the door on the hope that simple linear combinations of parsimony scores could fix the statistical inconsistency of GTP under the MSC. It establishes a fundamental limitation of the parsimony approach in the presence of ILS.
Practical Guidance:
- Researchers relying on GTP should be aware that increasing the number of genes does not guarantee convergence to the true tree if the data falls into an anomaly zone.
- Weighting Strategy: If GTP must be used (due to computational constraints), the authors suggest that minimizing the weight of deep coalescence (or using pure duplication cost) yields better empirical performance, as deep coalescence costs are more prone to inconsistency on asymmetric trees, which are common in real data.
Future Directions: The paper highlights the need for methods that are consistent under unified models (DLCoal) and suggests that quartet-based methods may offer a path forward, though sample complexity and rooting errors remain open challenges.

In summary, the paper provides a rigorous mathematical proof that Gene Tree Parsimony is fundamentally inconsistent under the Multispecies Coalescent, regardless of how duplication, loss, and deep coalescence costs are weighted, and validates this through extensive simulations showing that these methods fail to converge to the true species tree in anomaly zones.

On the consistency of duplication, loss, and deep coalescence gene tree parsimony costs under the multispecies coalescent