Disentangling the Impacts of Incomplete Lineage Sorting and Gene Tree Estimation Error on Species Tree Inference

This study demonstrates through systematic simulations that gene tree estimation error (GTEE) typically exerts a stronger detrimental impact on species tree accuracy than incomplete lineage sorting (ILS), despite comparable overall discordance levels, by revealing that GTEE generates uniform, high-entropy noise while ILS produces structured, constrained skew in quartet distributions.

Original authors: Tahmid, N., Rhythm, S. I., Bayzid, M. S.

Published 2026-02-21
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to reconstruct the family history of a group of animals (like birds) by looking at their DNA. You have thousands of different "chapters" from their genetic books (genes), and you want to piece them together to draw the correct family tree.

However, there's a problem: these chapters often tell different stories. Sometimes they disagree with each other, and sometimes they even disagree with the "true" family tree you are trying to find. This disagreement is called Gene Tree Discordance.

This paper investigates why these chapters disagree and which reason is more dangerous for getting the right answer. The authors found two main culprits:

  1. The "Real" Confusion (Incomplete Lineage Sorting - ILS): This is a biological reality. Imagine a family where three cousins are born in quick succession. If you ask them, "Who is your closest relative?" they might genuinely be unsure because their ancestors hadn't fully separated yet. The DNA is actually mixed up due to how evolution works. This is ILS.
  2. The "Bad Translation" (Gene Tree Estimation Error - GTEE): This is a technical mistake. Imagine trying to read a very blurry, smudged, or tiny piece of paper. You might misread a word or guess the wrong sentence structure. The DNA isn't actually mixed up; you just didn't have enough information to read it correctly. This is GTEE.

The Big Experiment: The "Twin" Test

The researchers wanted to know: Which is worse for building the family tree? Is it the real biological confusion (ILS) or the mistake caused by bad data (GTEE)?

To find out, they created a controlled experiment using computer simulations. They built two sets of "fake" family trees:

  • Set A: Had the same amount of disagreement, but it was caused only by the real biological confusion (ILS).
  • Set B: Had the exact same amount of disagreement, but it was caused only by bad data/blurry reading (GTEE).

They then asked their best computer programs (the "detectives") to solve the mystery using both sets.

The Shocking Discovery

The "Bad Translation" (GTEE) is much more dangerous than the "Real Confusion" (ILS).

Here is the analogy:

  • ILS (Real Confusion): Imagine a group of people trying to solve a puzzle, but they are genuinely confused about a few pieces. If you give them more people (more genes) to help, they eventually figure it out. The more data you add, the clearer the picture becomes.
  • GTEE (Bad Translation): Imagine the same group, but now everyone is wearing foggy glasses. They are all making the same kind of mistake because the data is too short or blurry. If you give them more people (more genes), they just get more people making the same wrong guess. Adding more data doesn't help; it just adds more noise.

The study showed that even when the "noise level" was identical, the family trees built from the "bad translation" data were much more wrong than those built from the "real confusion" data.

The "Quartet" Clue

How did they tell the difference? They looked at the "voting patterns" of the DNA.

  • Under ILS: The votes are messy, but they are structured. The "correct" answer is still the most popular vote, just slightly less popular than it should be. It's like a noisy room where the right answer is still being shouted the loudest.
  • Under GTEE: The votes are flat and random. The "correct" answer loses its voice. The votes are spread out evenly among wrong answers. It's like a room where everyone is whispering different wrong things, and the right answer is drowned out.

The Bird Case Study

To prove this wasn't just a computer game, they looked at a real dataset of 48 bird species (which are famous for having rapid, confusing evolutionary histories).

They found that:

  • Short DNA sequences (Exons): These were like the "blurry glasses." They caused a lot of "bad translation" errors. When they used only these short sequences, the family tree was messy and wrong.
  • Long DNA sequences (Introns): These were like "clear glasses." They had less error. When they used these, the family tree became much more accurate.

The Lesson: If you have a dataset with thousands of genes, but they are all short and "blurry," adding more of them won't help. You need to filter out the bad data and focus on the long, clear sequences.

Summary in One Sentence

While nature sometimes genuinely mixes up family histories (which we can fix with more data), bad data quality creates a type of confusion that tricks our computers, and adding more bad data only makes the mistake worse. To get the right family tree, we must distinguish between "real biological noise" and "technical errors."

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →