How much information is there for inferring species trees?

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to solve a massive, ancient family reunion photo puzzle. You have thousands of pieces (DNA sequences from different parts of the genome) and you want to figure out exactly how everyone is related. This is what scientists call phylogenetics—drawing the family tree of species.

For a long time, scientists thought the rule was simple: "More pieces = Better picture." They assumed that if you just threw every single piece of DNA you had into the computer, the answer would be perfect.

But this new paper by Analisa Milkey and her team says: "Wait a minute. Sometimes, adding more pieces actually makes the picture blurrier."

Here is the breakdown of their discovery, using some everyday analogies.

1. The Problem: Too Much Noise

Imagine you are trying to hear a friend whisper a secret in a crowded, noisy room.

The Good Data: These are clear, loud whispers. They tell you exactly what your friend said.
The Bad Data: These are people shouting random nonsense, or people whispering so softly you can't hear them at all.

In the past, scientists would grab all the voices in the room (all the DNA) and try to figure out the secret. But if you include the people shouting nonsense (saturation/mutation noise) or the people whispering too quietly (no variation), the computer gets confused. It spends all its energy trying to make sense of the noise, and the final family tree ends up looking a bit wobbly.

2. The New Tool: The "Information Meter"

The authors invented a new way to measure how "useful" a piece of DNA is. They call it Phylogenetic Information Content.

Think of it like a flashlight in a dark room:

Low Information: A dim, flickering candle. It doesn't help you see much.
High Information: A bright, steady spotlight. It clearly illuminates the furniture (the family tree).

Their method compares two things:

The Guess (Prior): What the family tree looks like before we look at the DNA (just a guess).
The Reality (Posterior): What the tree looks like after we analyze the DNA.

If the DNA is great, the "Reality" tree shrinks down into a tiny, precise shape. The "Guess" was huge and vague. The difference between the two is the Information. If the DNA is bad, the "Reality" tree looks almost exactly like the "Guess"—meaning the data taught us nothing new.

3. The Experiments: What They Found

Experiment A: Length Matters (But only up to a point)
They tested if longer DNA strands were better.

Analogy: Reading a short sentence vs. a long book.
Result: Going from a short sentence to a long book helped a lot. But once you have a really good book, reading a second book of the exact same story doesn't help you understand the plot any better. You just waste time reading.

Experiment B: Quantity vs. Quality
They tested if having more DNA strands (loci) was always better.

Analogy: Asking 100 people for directions vs. asking 10 experts.
Result: If you ask 100 people but 90 of them are giving you wrong directions (low-quality data), you will get lost. If you ask only the 10 experts who know the way, you get there faster and more accurately.
Key Finding: When the data is "uninformative" (noisy or too quiet), throwing more of it at the problem actually makes the final tree less accurate.

Experiment C: The Speed of Evolution
They looked at DNA that changes very slowly vs. very quickly.

Slow DNA: Like a photo that hasn't been updated in 100 years. It's hard to tell who is related to whom because nothing has changed. (Low info).
Fast DNA: Like a photo that changes every second. It's chaotic and hard to read because the details are blurring. (Low info).
Just Right DNA: The sweet spot where there is enough change to see relationships, but not so much that it's a blur. (High info).

4. The Big Takeaway: Be a Curator, Not a Hoarder

The paper suggests that for scientists trying to build family trees, quality is more important than quantity.

Instead of dumping the entire dataset into the computer, scientists should:

Measure the "brightness" of each piece of DNA using their new meter.
Throw away the dim candles (the uninformative, noisy, or silent DNA).
Keep only the spotlights.

The Analogy of the Chef:
Imagine you are making a soup.

Old Way: Throw in every vegetable you have in the fridge, even the rotting ones and the ones that are just plain water. The soup tastes muddy.
New Way: Taste each vegetable first. Keep the fresh, flavorful carrots and potatoes. Throw away the rotting ones and the water. The resulting soup is delicious and clear.

Why Does This Matter?

Computers take a long time to crunch massive amounts of data. By filtering out the "bad" data first, scientists can:

Save massive amounts of computing time and money.
Get a more accurate family tree faster.
Avoid the trap of thinking "more data is always better," which can actually lead to wrong conclusions.

In short: Don't just collect more data. Collect the right data. Sometimes, knowing what to leave out is the key to finding the truth.

1. Problem Statement

The field of phylogenomics is currently characterized by an abundance of genomic data. However, a significant challenge arises from Incomplete Lineage Sorting (ILS), where gene trees differ topologically from the species tree due to deep coalescence. While the Multispecies Coalescent (MSC) model accounts for ILS, Bayesian implementations are computationally intensive and often struggle with large datasets.

Current practices often assume that "more data is better," leading researchers to include all available loci in analyses. However, this approach ignores the informativeness of individual loci. Some loci may be uninformative due to:

Low variation: Too few substitutions to resolve relationships.
Saturation: High substitution rates causing multiple hits per site, obscuring phylogenetic signal.

Existing measures of phylogenetic information content have historically focused on gene trees rather than species trees and often rely on discrete topologies, ignoring continuous parameters like branch lengths. Furthermore, previous studies using substitution rates as a proxy for information content (e.g., Lanier et al., 2014) may be misleading. The authors aim to determine whether subsampling datasets to include only the most informative loci can improve species tree inference accuracy and computational efficiency without sacrificing resolution.

2. Methodology

A. New Metric: Phylogenetic Information Content ( $I$ )

The authors introduce a novel measure of information content ( $I$ ) that quantifies the reduction in "tree space" occupied by a posterior sample relative to a prior sample.

Concept: It calculates the difference between the spread of trees in the prior distribution (before seeing data) and the posterior distribution (after seeing data).
Calculation:
1. Generate a sample of trees from the prior and a sample from the posterior.
2. Calculate the mean tree for both samples (using the method by Miller, Owen, and Provan, 2015).
3. Compute the geodesic distance (Owen and Provan, 2010) between the mean tree and every sampled tree within the BHV space (Billera-Holmes-Vogtmann space).
4. Determine the radius ( $R$ ) of a hypersphere that encompasses 95% of the sampled trees closest to the mean.
5. Formula:
  $I = \left( \frac{R_{prior} - R_{post}}{R_{prior}} \right) \times 100$
- Interpretation: A value near 100% indicates high information (significant reduction in tree space); 0% indicates no information.
- Normalization: To focus on topological information and prevent branch length dominance, the prior mean tree is scaled to match the total length of the posterior mean tree.

B. Software and Simulation Framework

Inference Engine: SMC (Sequential Monte Carlo), which samples from the MSC model.
Tree Distance/Geometry: op software (open-source) for calculating mean trees and BHV distances.
Experimental Design:
- Experiment 1 (Sequence Length): Simulated 10 loci, 5 species, varying sites per locus (10, 100, 1000, and infinite/true gene trees).
- Experiment 2 (Number of Loci): Simulated varying numbers of loci (10 to 100) using true gene trees (infinite sites) to isolate the effect of locus count.
- Experiment 3 (Rate Variation): Simulated 100 loci with varying relative evolutionary rates (0.0001 to 3.3). Analyzed species trees using loci filtered by information content cutoffs (0% to 90%).
- Empirical Application: Applied to a teleost fish dataset (22 species, 16 loci) to test the filtering strategy on real data.

3. Key Contributions

Development of a Species-Tree Specific Information Metric: Unlike previous metrics focused on gene trees or discrete topologies, this method accounts for both topology and branch lengths in the context of species tree inference.
Quantification of the "More Data" Myth: The study provides empirical evidence that adding more data does not always improve inference; specifically, adding uninformative loci can degrade accuracy.
Subsampling Strategy: Demonstrates that filtering loci based on calculated information content ( $I$ ) rather than arbitrary biological proxies (like substitution rate) yields superior species tree estimates.

4. Results

Experiment 1: Sequence Length

Finding: Information content and accuracy increased significantly with sequence length, particularly from 10 to 100 sites.
Data: Average information content rose from 60.6% (10 sites) to 93.3% (infinite sites).
Conclusion: When data are informative, more data improves inference. However, even with infinite sites, 10 loci were insufficient for perfect accuracy in high-ILS scenarios.

Experiment 2: Number of Loci

Finding: Information and accuracy increased with the number of loci but showed diminishing returns.
Data: Moving from 30 to 100 loci resulted in only marginal improvements (Information: 99.1% $\to$ 99.8%; Inaccuracy: 0.0028 $\to$ 0.0020).
Conclusion: Once a sufficient number of informative loci are included, adding more loci offers negligible practical benefit relative to the computational cost.

Experiment 3: Rate Variation and Filtering

Finding: Locus informativeness is non-linear with respect to evolutionary rate.
- Low rates: Low information due to lack of variable sites (e.g., 0.7% at rate 0.0001).
- High rates: High information, though saturation can eventually reduce it (though not observed at the highest tested rates).
Filtering Impact:
- Best Accuracy: Achieved at a 70% information cutoff (excluding the least informative loci), resulting in the lowest BHV distance (0.058).
- Worst Accuracy: Occurred at a 10% cutoff (including almost all loci, including uninformative ones), with a BHV distance of 0.080.
- Over-filtering: Excluding too many loci (e.g., >50% cutoff, leaving only 1 locus) drastically reduced information content.

Empirical Dataset (Teleost Fishes)

Finding: Average locus information was 31.69%.
Result: Species tree information content generally increased as the cutoff for inclusion was raised (removing low-information loci), peaking around the 30% cutoff.
Caveat: Aggressive filtering (40%+ cutoff) reduced the number of loci too much (down to 6 or 1), causing a drop in species tree information, highlighting the need for a balance between quality and quantity.

5. Significance and Recommendations

Paradigm Shift: The paper challenges the dogma that "more data is always better." It argues that data quality (informativeness) is more critical than quantity for species tree inference under the MSC.
Practical Workflow:
1. Estimate gene trees from the prior and posterior.
2. Calculate the information content ( $I$ ) for each locus.
3. Subsample: Remove loci with very low information content (e.g., those contributing little to the reduction of tree space).
4. Avoid Over-filtering: Ensure enough loci remain to maintain statistical power (avoiding the "single gene" problem).
Methodological Caveat: The authors note that their method relies on estimating gene trees first. For extremely short or low-information loci, methods that integrate out gene trees (e.g., SNAPP, SVDQuartets) might be more appropriate as they can utilize all data without the error of intermediate gene tree estimation.

In summary, this paper provides a rigorous mathematical framework for assessing phylogenetic information and offers a data-driven strategy for subsampling genomic datasets to optimize the trade-off between computational efficiency and species tree accuracy.