Methodological pitfalls in plant pangenome gene family identification may lead to biased evolutionary inferences

This study demonstrates that relying solely on sequence similarity for pangenome gene family identification introduces significant biases in evolutionary inferences, and recommends a two-step strategy combining graph-based orthology with sequence refinement to ensure accurate results.

Original authors: Liu, S., Zhang, W., Yu, P.

Published 2026-05-18
📖 4 min read☕ Coffee break read

Original authors: Liu, S., Zhang, W., Yu, P.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to organize a massive library containing books from 401 different branches of the same family (in this case, 401 different rice plants). Your goal is to group these books into "families" based on how similar their stories are. Some books are the exact same story found in every branch (the "core" stories), some are shared by a few branches (the "shell"), and some are unique to just one branch (the "cloud").

This paper is a warning about how scientists have been sorting these book families.

The Problem: Sorting by Cover Art Only
Many researchers have been using a quick, automated method to sort these books. They look at the "cover art" (the sequence of letters in the DNA) and group books together if the covers look similar enough. They do this without checking the actual plot or the history of the book.

The authors of this paper say this is like trying to sort a library by only glancing at the spine color. You might accidentally put a mystery novel next to a romance novel just because they both have red spines, even though the stories inside are completely different. In scientific terms, this "cover-only" method (using tools like cd-hit or MMseqs2 alone) tends to mash distinct groups of genes together, creating fewer, messy groups than there actually are.

The Experiment: A Test with Five Famous Families
To prove this, the researchers took five very important groups of rice genes (think of them as five famous book series: bHLH, MYB, NAC, WRKY, and MADS-box) and tried to sort them using four different strategies:

  1. The Quick Sort: Just using the "cover art" similarity tools.
  2. The History Check: Using a more advanced tool (OrthoFinder) that looks at the family tree and how the books are arranged on the shelf (phylogeny and synteny).
  3. The Hybrid Approach: Using the "History Check" first to get the big picture, then using the "Quick Sort" to fine-tune the details.

The Results: Chaos vs. Clarity
The results showed that the "Quick Sort" methods made a lot of mistakes.

  • The Mix-Up: Depending on the gene family, the quick methods disagreed with the accurate "History Check" method anywhere from 14% to 57% of the time. For the MYB family, more than half the books were sorted into the wrong pile!
  • The Size Issue: The quick methods often confused genes just because they were different lengths, like grouping a short story with a novel just because the cover looked similar.
  • The Impact: Because the piles were wrong, the scientists' classification of which genes were "core" (found everywhere) and which were "cloud" (rare) changed drastically.

The Evolutionary Consequence: Reading the Wrong Plot
The most critical finding was about how these genes evolved. Scientists often measure "selective pressure" (how much nature is pushing a gene to change) by comparing the speed of different types of mutations (Ka/Ks).

  • When the "Quick Sort" was used, the results were all over the place, like a noisy radio with static.
  • When the "History Check" (graph-based) method was used, the results were clear and consistent.
  • Interestingly, for the rare "cloud" genes, the method didn't matter as much, but for the common "core" genes, using the wrong sorting method led to completely wrong conclusions about how they evolved.

The Solution: A Two-Step Strategy
The paper concludes that you cannot rely on simple similarity alone. Instead, they recommend a two-step strategy:

  1. First, build a family tree: Use a method that understands evolutionary history to draw the main lines between gene groups.
  2. Second, polish the details: Use the fast similarity tools to clean up the edges of those groups.

In short: If you want to understand the evolutionary story of rice genes, you can't just look at the cover. You need to read the family history first, or you'll end up telling a story that never happened.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →