On the correctness of gene tree tagging under a unified model of gene duplication, loss, and coalescence

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to reconstruct the family history of a massive, ancient clan. You have thousands of old, fragmented letters (genes) written by different members of the family. Your goal is to piece together the true "Family Tree" (the species tree) that shows how the different branches of the clan split apart over time.

Usually, this is hard because family members don't always follow the main family tree. Sometimes, cousins stay together longer than they should (a phenomenon called Incomplete Lineage Sorting), and sometimes, a family branch suddenly splits into two identical copies of itself (a Gene Duplication).

For years, scientists had a great tool called ASTRAL to solve this, but it had a strict rule: it could only handle families where everyone had exactly one copy of every letter. If a family had duplicates (twins, triplets, etc.), ASTRAL had to throw those letters away, losing a huge amount of data.

A newer tool, ASTRAL-pro, was invented to handle these messy, multi-copy families. It tries to look at the letters, figure out which ones are "twins" (duplicates) and which are "originals" (speciations), and then use that information to build a better tree. However, there was a big question hanging over it: Is ASTRAL-pro actually looking at the letters correctly?

Here is the simple breakdown of what this paper does:

1. The "Tagging" Problem: Who is the Twin?

Imagine you find a letter in the family archive. You need to decide: "Is this letter a copy of an older letter (a duplication), or is it a brand new letter written when the family split (a speciation)?"

The Old Way: If the family history was simple, you could just look at the date and say, "This is a twin."
The New Problem: But families are messy. Sometimes, two "twins" get mixed up in the mail (deep coalescence), making it look like they are unrelated, or making unrelated letters look like twins.
The Paper's Solution: The authors propose a new, strict rule for tagging. They say: "A letter is a 'twin' (duplication) if it is the most recent common ancestor of at least one pair of copies that are definitely related by duplication."

Think of it like a detective rule: If you see two cousins who are definitely related because their parents were twins, then the grandparent who started that twin line is tagged as a "Twin Ancestor." This rule works even when the family history is messy and confusing.

2. The "Filter" Analogy

Once the letters are tagged, ASTRAL-pro uses a special filter.

The "Speciation" Letters: These tell the story of how the family split into different branches. These are Gold.
The "Duplication" Letters: These tell the story of how copies were made within a branch. These are Noise when trying to figure out the main family tree.

ASTRAL-pro tries to throw away the "Noise" (Duplication Letters) and only count the "Gold" (Speciation Letters) to build the tree.

3. The Big Question: Does the Filter Work?

The authors asked: "If we use our new detective rule to tag the letters, does ASTRAL-pro actually find the correct Family Tree?"

The Theory: They tried to prove mathematically that yes, it should work. They found that while it works for simple families, the "messy mail" (deep coalescence) makes the math incredibly tricky. They couldn't fully prove it yet, but they showed that the logic holds up in most cases.
The Experiment: They built a computer simulation (a fake family history) with thousands of gene trees, including lots of twins and messy mail. They tested ASTRAL-pro against other methods.

4. The Results: It Works!

The experiments showed that:

ASTRAL-pro (and their new tool, TQMC-pro) are much better at finding the true Family Tree than the old methods that just ignore duplicates.
Even when the "tagging" isn't perfect (the detective makes a few mistakes), the final Family Tree is still very accurate. It's like having a few wrong clues in a mystery novel, but still solving the case correctly because the other clues are so strong.
When they tested this on real plant data (the "1KP" plant dataset), ASTRAL-pro and their new tool produced a beautiful, logical tree that matched what we know about plants. The old method (ASTRAL-multi) got confused and produced a jumbled mess.

The Takeaway

This paper is like a quality control check for a new, powerful tool. The authors said, "We have a new way to sort through messy family histories. Here is a strict rule for how to sort them, and here is proof that even if the sorting isn't 100% perfect, the final result is still the best way to understand our evolutionary history."

They didn't just say "it works"; they gave us a clear definition of why it works and showed that it handles the messy, complex reality of evolution much better than previous methods.

1. Problem Statement

The reconstruction of species trees from genomic data is complicated by Gene Tree Heterogeneity (GTH), primarily caused by Incomplete Lineage Sorting (ILS) (modeled by the Multi-Species Coalescent, MSC) and Gene Duplication and Loss (GDL).

Current Limitations: The leading method, ASTRAL, is statistically consistent under the MSC but assumes genes evolve without duplication. While ASTRAL-multi (A-multi) extends this to multi-copy genes, it often underperforms in simulations involving both ILS and GDL.
The A-pro Approach: ASTRAL-pro (A-pro) improves accuracy by rooting and tagging gene tree internal vertices as either "duplications" or "speciations." It excludes "duplication quartets" (Qs driven by duplication events) and aggregates "speciation quartets."
The Core Issue: A-pro's tagging logic is well-defined for GDL-only scenarios but lacks a rigorous definition of "correct tagging" when deep coalescence (ILS) is present. Deep coalescence can cause gene copies to coalesce in ways that obscure the true duplication history, potentially leading to incorrect vertex tagging. Without a formal definition of correctness, the statistical consistency of A-pro under the unified DLCoal (Duplication-Loss-Coalescent) model remains unproven.

2. Methodology

A. Theoretical Framework: Defining Correct Tagging

The authors propose a new, broadly applicable definition for Correct Duplication Tagging under the DLCoal model:

Definition: An internal vertex $u$ in a gene tree is correctly tagged as a duplication if it is the Most Recent Common Ancestor (MRCA) of at least one pair of gene copies ( $x, y$ ) that are paralogs (i.e., their MRCA in the underlying locus tree is a duplication event).
Rationale: This definition is backward-compatible with locus trees and aligns with A-pro's heuristic (which tags vertices as duplications if they are the MRCA of two copies from the same species).
Statistical Analysis: The authors attempt to prove Conjecture 1: A-pro is statistically consistent under the DLCoal model assuming correct tagging.
- They analyze the expected number of Speciation Quartets (SQs) for the true species tree topology versus alternative topologies.
- They utilize locus lineage scenarios (relationships of gene copies to lineages in the species tree) to classify quartets.
- Key Finding: They prove that if gene copies descend from different locus lineages at the root of the quartet, the quartet is a Duplication Quartet (DQ). However, they identify a hurdle: lineage swapping due to deep coalescence can trigger or untrigger duplication vertices in the gene tree, breaking the standard "exchangeability" arguments used in MSC proofs. Consequently, the consistency of A-pro remains an open question, though partial proofs are provided.

B. Algorithmic Implementation: TREE-QMC-pro

To empirically test the "exclusion-only" version of A-pro's objective function (excluding DQs without agglomerating homeomorphic SQs), the authors implemented a new method called TQMC-pro.

Base: Built upon TREE-QMC, a heuristic for the Maximum Quartet Support Species Tree problem using a divide-and-conquer approach.
Modification: The recurrence relations for "auxiliary values" and edge weights were modified to exclude contributions from Duplication Quartets (DQs).
Limitation: TQMC-pro currently does not support the "agglomeration" of homeomorphic speciation quartets (a feature of A-pro) because homeomorphic SQs can have different weights, complicating the graph normalization step.

C. Empirical Evaluation

The authors conducted extensive simulations and a real-data re-analysis:

Simulation Study:
- Parameters: 25 taxa, 2,000 gene trees, varying duplication rates, ILS levels (effective population size), and gene tree estimation error (GTEE).
- Comparison: TQMC-pro vs. A-pro vs. A-multi vs. standard TQMC.
- Metrics: Tagging precision/recall and Normalized Robinson-Foulds (RF) distance for species tree accuracy.
Real Data Re-analysis:
- Dataset: The 1KP plant dataset (83 taxa, ~9,200 gene families).
- Comparison: Trees generated by A-pro, TQMC-pro, and A-multi were compared against a single-copy ASTRAL benchmark.

3. Key Contributions

Formal Definition of Tagging Correctness: Introduced a rigorous definition of correct duplication tagging that accounts for deep coalescence, bridging the gap between GDL-only and unified DLCoal models.
Theoretical Insights: Provided partial proofs for A-pro's consistency and identified specific "adversarial scenarios" (involving lineage swapping and deep coalescence) that challenge standard exchangeability arguments, highlighting why a full proof of consistency is difficult.
New Algorithm (TQMC-pro): Developed the first method to implement the exclusion of duplication quartets within the TREE-QMC framework, allowing for weighted quartets based on branch lengths while excluding DQs.
Empirical Validation: Demonstrated that excluding duplication quartets significantly improves species tree accuracy in the presence of high ILS and duplication rates compared to methods that do not exclude them (A-multi).

4. Results

Simulation Results

Tagging Accuracy: A-pro's tagging algorithm achieved high precision (>0.75) and recall (>0.8) across most conditions. Accuracy decreased slightly with higher ILS and Gene Tree Estimation Error (GTEE), but remained robust.
Species Tree Accuracy:
- TQMC-pro and A-pro (using true or A-pro tagging) performed nearly identically and outperformed A-multi and standard TQMC.
- The performance gap widened with higher duplication levels and higher ILS.
- Example: At high ILS (0.7) and low gene count (250), A-pro/TQMC-pro error decreased as duplication increased, whereas A-multi error increased.
Impact of GTEE: As gene tree estimation error increased, species tree error generally increased for all methods, but "pro" methods remained more robust than "multi" methods.

Plant Re-analysis (1KP Dataset)

Topology: A-pro and TQMC-pro produced trees highly congruent with the single-copy ASTRAL benchmark (differing by only 4–5 branches out of 77) and successfully recovered major clades (e.g., Monocots, Eudicots).
Failure of A-multi: A-multi differed from the benchmark by 58 branches and failed to recover major clades, often resulting in branch support of 0.
Optimization Check: The authors confirmed that the A-multi tree was indeed the optimal solution for the A-multi objective function. This suggests the poor performance of A-multi is due to its objective function (which includes duplication quartets) rather than a failure to find the global optimum.

5. Significance

Validation of A-pro: The study provides strong empirical evidence that A-pro's strategy of excluding duplication quartets is superior for species tree inference under complex evolutionary scenarios involving both ILS and GDL.
Theoretical Foundation: By defining "correct tagging" in the context of deep coalescence, the paper lays the groundwork for future theoretical proofs of consistency for A-pro and similar methods.
Practical Tool: TQMC-pro offers a flexible alternative to A-pro, demonstrating that the exclusion of duplication quartets is the critical factor for accuracy, regardless of the specific optimization heuristic used (ASTRAL vs. TREE-QMC).
Resolution of Controversy: The results clarify why A-multi fails in high-duplication/high-ILS scenarios: it treats duplication-driven quartets as informative signal, whereas they are actually noise for species tree topology.

In conclusion, the paper establishes that while a full theoretical proof of A-pro's consistency under DLCoal remains an open challenge due to the complexities of deep coalescence, the exclusion of duplication quartets is a statistically sound and empirically superior strategy for reconstructing species trees from multi-copy gene families.