Estimating Bayesian phylogenetic information content using geodesic distances

This paper introduces a scalable Bayesian measure of phylogenetic information content based on geodesic distances in treespace, which quantifies the reduction in tree variance from prior to posterior distributions to assess data informativeness and detect conflicts among datasets.

Milkey, A., Lewis, P. O.

Published 2026-04-01
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to solve a massive jigsaw puzzle, but you've never seen the picture on the box before. You have a bag of pieces (your DNA data) and a vague idea of what the picture might look like based on general knowledge (your "prior" guess).

This paper introduces a new, clever way to measure how much the puzzle pieces actually tell you about the final picture, and whether different groups of pieces are arguing with each other.

Here is the breakdown of their new method using simple analogies:

1. The Old Way vs. The New Way

The Old Way (Counting Topologies):
Previously, scientists tried to measure information by counting how many different tree shapes (topologies) were possible. Imagine trying to guess a password by counting how many combinations exist. If you have a 12-digit password, there are billions of possibilities. If your data narrows it down to just a few, you have a lot of information.

  • The Problem: As the number of species (taxa) grows, the number of possible trees explodes into numbers so huge (like the number of atoms in the universe) that you can't count them all. It's like trying to count every grain of sand on a beach to see how much sand you have.

The New Way (The "Cloud" of Trees):
Instead of counting individual trees, the authors look at the shape and spread of the trees.

  • The Analogy: Imagine the "Prior" (your guess before seeing data) is a giant, fluffy cloud of smoke floating in a room. It's spread out everywhere because you have no idea where the picture is.
  • Now, you look at the "Posterior" (the result after looking at the DNA data). This is a second cloud.
  • The Measurement: If the data is useless, the second cloud looks just like the first one—big and fluffy. If the data is amazing, the second cloud shrinks down into a tiny, dense ball.
  • The Metric: They measure the distance between the center of the first cloud and the center of the second cloud, and how "squished" the second cloud is. The more the cloud shrinks and moves, the more information you have.

2. The "Geodesic" Shortcut

To measure the distance between these tree shapes, they use something called Geodesic Distance.

  • The Analogy: Imagine the room where the clouds are floating is a strange, curved landscape (like the surface of the Earth, not a flat floor). To get from point A to point B, you can't just walk in a straight line through the walls; you have to walk along the surface.
  • In the world of evolutionary trees, the "surface" is a complex mathematical space called "Treespace." The authors use a special map (an algorithm) to find the shortest, most natural path between two trees on this curved surface. This allows them to measure how different the trees really are, even if they look very complicated.

3. Measuring "Dissonance" (The Argument)

Sometimes, different parts of your DNA tell different stories. Maybe the first half of a gene says "We are related to birds," while the second half says "No, we are related to lizards." This is called Dissonance.

  • The Analogy: Imagine you are asking two friends to describe a suspect.
    • Friend A says: "He's tall, wearing a red hat."
    • Friend B says: "He's short, wearing a blue hat."
    • If you ask them separately, they are both confident (low variance). But if you put them together, they are arguing (high dissonance).
  • The authors' method calculates a "Dissonance Score." If the score is low, the data agrees. If the score is high, the data is fighting, suggesting something weird happened (like a gene jumping from one species to another, which actually happened in the real-world example they studied with a plant called Bloodroot).

4. Why This Matters

  • Scalability: This method works even if you have thousands of species. You don't need to count every possible tree; you just need to see how much the "cloud" of likely trees shrinks.
  • Truth vs. Noise: It helps scientists decide if a specific piece of DNA is actually useful or if it's just random noise (like a book filled with gibberish).
  • Filtering: In modern "phylogenomics" (studying evolution using huge amounts of DNA), scientists often have thousands of genes. This tool helps them filter out the "boring" genes that don't tell us anything and focus only on the "loud" genes that have a clear story to tell.

The Bottom Line

This paper gives scientists a new ruler. Instead of trying to count the impossible number of tree shapes, they measure how much the data concentrates the possibilities.

  • Big, fuzzy cloud after data? = No information.
  • Tiny, sharp ball after data? = Lots of information.
  • Two clouds pointing in opposite directions? = The data is confused (Dissonance).

It's a way to turn the chaotic noise of DNA sequencing into a clear, measurable signal of evolutionary history.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →