Estimating Bayesian phylogenetic information content using geodesic distances

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to solve a massive jigsaw puzzle, but you've never seen the picture on the box before. You have a bag of pieces (your DNA data) and a vague idea of what the picture might look like based on general knowledge (your "prior" guess).

This paper introduces a new, clever way to measure how much the puzzle pieces actually tell you about the final picture, and whether different groups of pieces are arguing with each other.

Here is the breakdown of their new method using simple analogies:

1. The Old Way vs. The New Way

The Old Way (Counting Topologies):
Previously, scientists tried to measure information by counting how many different tree shapes (topologies) were possible. Imagine trying to guess a password by counting how many combinations exist. If you have a 12-digit password, there are billions of possibilities. If your data narrows it down to just a few, you have a lot of information.

The Problem: As the number of species (taxa) grows, the number of possible trees explodes into numbers so huge (like the number of atoms in the universe) that you can't count them all. It's like trying to count every grain of sand on a beach to see how much sand you have.

The New Way (The "Cloud" of Trees):
Instead of counting individual trees, the authors look at the shape and spread of the trees.

The Analogy: Imagine the "Prior" (your guess before seeing data) is a giant, fluffy cloud of smoke floating in a room. It's spread out everywhere because you have no idea where the picture is.
Now, you look at the "Posterior" (the result after looking at the DNA data). This is a second cloud.
The Measurement: If the data is useless, the second cloud looks just like the first one—big and fluffy. If the data is amazing, the second cloud shrinks down into a tiny, dense ball.
The Metric: They measure the distance between the center of the first cloud and the center of the second cloud, and how "squished" the second cloud is. The more the cloud shrinks and moves, the more information you have.

2. The "Geodesic" Shortcut

To measure the distance between these tree shapes, they use something called Geodesic Distance.

The Analogy: Imagine the room where the clouds are floating is a strange, curved landscape (like the surface of the Earth, not a flat floor). To get from point A to point B, you can't just walk in a straight line through the walls; you have to walk along the surface.
In the world of evolutionary trees, the "surface" is a complex mathematical space called "Treespace." The authors use a special map (an algorithm) to find the shortest, most natural path between two trees on this curved surface. This allows them to measure how different the trees really are, even if they look very complicated.

3. Measuring "Dissonance" (The Argument)

Sometimes, different parts of your DNA tell different stories. Maybe the first half of a gene says "We are related to birds," while the second half says "No, we are related to lizards." This is called Dissonance.

The Analogy: Imagine you are asking two friends to describe a suspect.
- Friend A says: "He's tall, wearing a red hat."
- Friend B says: "He's short, wearing a blue hat."
- If you ask them separately, they are both confident (low variance). But if you put them together, they are arguing (high dissonance).
The authors' method calculates a "Dissonance Score." If the score is low, the data agrees. If the score is high, the data is fighting, suggesting something weird happened (like a gene jumping from one species to another, which actually happened in the real-world example they studied with a plant called Bloodroot).

4. Why This Matters

Scalability: This method works even if you have thousands of species. You don't need to count every possible tree; you just need to see how much the "cloud" of likely trees shrinks.
Truth vs. Noise: It helps scientists decide if a specific piece of DNA is actually useful or if it's just random noise (like a book filled with gibberish).
Filtering: In modern "phylogenomics" (studying evolution using huge amounts of DNA), scientists often have thousands of genes. This tool helps them filter out the "boring" genes that don't tell us anything and focus only on the "loud" genes that have a clear story to tell.

The Bottom Line

This paper gives scientists a new ruler. Instead of trying to count the impossible number of tree shapes, they measure how much the data concentrates the possibilities.

Big, fuzzy cloud after data? = No information.
Tiny, sharp ball after data? = Lots of information.
Two clouds pointing in opposite directions? = The data is confused (Dissonance).

It's a way to turn the chaotic noise of DNA sequencing into a clear, measurable signal of evolutionary history.

1. Problem Statement

The paper addresses the challenge of quantifying the phylogenetic information content in biological sequence data within a Bayesian framework.

Limitations of Previous Methods: Existing methods, such as those based on relative entropy (Lewis et al., 2016), rely on comparing the probability distributions of discrete tree topologies. These methods suffer from poor scalability. As the number of taxa increases, the space of possible tree topologies becomes vast (e.g., >650 million trees for just 12 taxa), making it impossible to sample the posterior distribution adequately to estimate entropy accurately.
Need for Topological and Length Sensitivity: Researchers need a metric that captures not just topological resolution but also information regarding branch lengths, and a way to measure dissonance (conflict) between different data subsets (e.g., different genes or loci) without relying on arbitrary topology-only distance metrics.

2. Methodology

The authors propose a new approach based on geodesic distances in the space of phylogenetic trees (treespace), utilizing the framework developed by Billera, Holmes, and Vogtmann (2001) and the algorithms of Owen and Provan (2010).

A. Information Content Measure (LCR)

The core metric is the Log Concentration Ratio (LCR), which compares the variance of the posterior distribution to the variance of the prior distribution.

Concept: Information is defined as the reduction in uncertainty (variance) from the prior to the posterior.
Calculation:
1. Generate a sample of trees from the prior distribution ( $N_0$ ) and the posterior distribution ( $N$ ) using MCMC.
2. Compute the Fréchet mean tree for both samples.
3. Calculate the geodesic distance (Owen-Provan distance) between the mean tree and every sampled tree.
4. Define the "volume" ( $V$ ) of the distribution. The authors primarily use the 95% radius (RAD): the radius of the smallest hypersphere centered at the mean tree that contains 95% of the sampled trees.
5. Formula:
  $LCR = \log\left(\frac{V_0}{V}\right)$
  Where $V_0$ is the prior variance (radius) and $V$ is the posterior variance.
Interpretation:
- $LCR = 0$: No information (Posterior variance = Prior variance).
- $LCR \to \infty$ : Complete information (Posterior variance $\to 0$ ).
- Percent Information ( $I$ ): To make the metric more intuitive, the authors transform LCR to a 0–100 scale: $I = 100(1 - e^{-LCR})$ .
Scaling: To isolate topological information from edge-length information, trees in both samples are scaled so that the mean tree length is identical (e.g., 1.0). This ensures the metric reflects the concentration of the distribution rather than just the absolute scale of branch lengths.

B. Dissonance Measure

To measure conflict between data subsets (e.g., two different genes), the authors define a Dissonance ( $D$ ) metric based on Cohen's $d$ effect size.

Formula:
$D = \frac{d_{12}}{\sqrt{\frac{(n_1-1)r_1^2 + (n_2-1)r_2^2}{n_1+n_2-2}}}$
Where:
- $d_{12}$ is the geodesic distance between the scaled mean trees of the two datasets.
- $r_1$ and $r_2$ are the 95% radii (variances) of the two posterior distributions.
Interpretation: A value of 0 indicates identical distributions. Higher values indicate greater conflict between the datasets.

3. Key Contributions

Scalability: Unlike entropy-based methods that struggle with large taxon sets, this method scales linearly with the number of sampled trees, making it applicable to phylogenomic datasets with hundreds or thousands of taxa.
Geometric Foundation: It utilizes the rigorous geometry of treespace (BHV space) to define variance and distance, providing a continuous measure of information rather than a discrete topological count.
Integrated Metric: It simultaneously captures information about both topology and branch lengths.
Dissonance Quantification: It provides a standardized, geometrically grounded metric to quantify conflict between data partitions, superior to simple topology-only distances like Robinson-Foulds.

4. Results

The authors validated the method through simulation experiments and empirical analyses.

Simulation Experiments

Substitution Rates: Information content peaked at an ideal substitution rate and decreased when rates were too low (no signal) or too high (saturation).
Sequence Length: Information increased with sequence length, though very short sequences (1 site) showed negative information (noise).
Missing Data: Information content decreased linearly as the percentage of missing data increased.
Rate Heterogeneity: High among-site rate variation (ASRV) reduced information content.
Dissonance: In random walk simulations, dissonance correlated strongly ( $r=0.96$ ) with the geodesic distance between the generating model trees, confirming the metric's sensitivity to topological conflict.

Empirical Analyses

Saturation Test (Green Algae psaB):
- Contrary to the assumption that 3rd codon positions are saturated, the method found they contained more information (LCR = 2.73, $I$ = 93.5%) than 2nd positions (LCR = 1.75, $I$ = 82.6%).
- The Fréchet mean tree from 3rd positions showed higher resolution.
- The method confirmed that 3rd positions were not misleading (misinformation) as their mean tree was closer to the "all-sites" mean tree than the 2nd-position tree was.
Horizontal Gene Transfer (Bloodroot rps11):
- The study analyzed a locus with a 5' end (vertically transferred) and a 3' end (horizontally transferred).
- Result: The dissonance between the 5' and 3' subsets was extremely high ( $D > 8$ ), while dissonance between independent samples of the same subset was negligible ( $D < 0.2$ ).
- The mean trees clearly showed the 5' end grouping with eudicots and the 3' end grouping with monocots, validating the method's ability to detect strong phylogenetic conflict.

5. Significance and Applications

Phylogenomics: The method offers a practical tool for locus filtering. In large-scale phylogenomic analyses, researchers can calculate the information content of individual loci and exclude those with low information (e.g., due to saturation or lack of signal) to improve computational efficiency and accuracy in species tree inference (e.g., in BEAST2 or ASTRAL).
Model-Based Robustness: Unlike saturation tests (e.g., PhyloMAd) that rely on simulation-derived critical values, this method uses the exact Bayesian model being employed for inference, making it robust to complex model specifications (e.g., CAT models).
Conflict Detection: It provides a rigorous, quantitative way to identify and measure gene tree discordance, aiding in the detection of biological phenomena like horizontal gene transfer or incomplete lineage sorting.

In summary, Milkey and Lewis present a scalable, geometrically rigorous framework for measuring how much data "tells us" about a phylogeny, moving beyond discrete topology counts to a continuous measure of variance reduction in treespace.