Statistical and structural bias in birth-death models

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to figure out the history of a family. You have a family tree, and your job is to guess two things:

How fast new family members were born (Speciation, $\lambda$ ).
How fast family members died out (Extinction, $\mu$ ).

For a long time, scientists have used a mathematical tool called a "birth-death model" to solve this mystery. But, as Jeremy Beaulieu and Brian O'Meara discovered in this paper, the tool they've been using has a few hidden glitches. It's like trying to weigh a feather using a scale meant for elephants—the results are often wrong, especially when the "feather" (the family tree) is small.

Here is the breakdown of their discovery, explained simply.

1. The "Cherry Tree" Problem (The Missing Piece)

Imagine you find a tiny family tree with only two people (a parent and one child, or two siblings). The authors call these "Cherry Trees."

The Glitch: If you try to use the standard math formulas on a Cherry Tree, the math breaks down. It's like trying to solve for $x$ and $y$ when you only have one equation. You simply don't have enough information to know if the family grew fast and died fast, or grew slow and died slow.
The Mistake: Because the math breaks, scientists often just throw these tiny trees away. They say, "We'll only look at families with 3 or more people."
The Consequence: By throwing away the tiny families, the scientists accidentally created a structural bias. It's like a census taker who only counts people in big houses and ignores everyone in small apartments. Suddenly, the average house looks huge, and the population looks different than it really is. This makes the scientists think that new species are appearing faster than they actually are, especially in young groups.

The Fix: The authors realized that if you must throw away the tiny trees, you have to change the math to account for the fact that you threw them away. It's like adjusting your census results to say, "We know we missed the small apartments, so let's add a correction factor."

2. The "Under-estimator" Bias (The Shy Calculator)

Even when the math works for big trees, the standard calculator has a personality flaw: it is shy.

The Glitch: The standard formula consistently guesses that the birth rate is lower than it really is.
The Analogy: Imagine you are guessing how many jellybeans are in a jar. The standard method always guesses 10% fewer than the actual number. If there are 100 jellybeans, it says 90. If there are 1,000, it says 900. It's a systematic error.
The Cause: This happens because of how the math handles the "end" of the tree. The authors did some heavy algebra (and used a computer to find patterns) to prove exactly how much it underestimates.
The Fix: They found a simple "magic multiplier." If you take the standard guess and multiply it by a specific fraction (related to how many leaves are on the tree), the shyness disappears, and the guess becomes accurate.

3. The "Extinction" Trap (The Harder Puzzle)

Fixing the birth rate was relatively easy. Fixing the death rate (extinction) was much harder.

The Glitch: The standard method doesn't just underestimate the death rate; it gets confused by the relationship between birth and death.
The Analogy: Imagine trying to guess how many people are leaving a party (extinction) while people are also arriving (birth). If you only look at the people currently at the party, it's hard to tell if the room is empty because people aren't arriving, or because they are leaving very fast.
The Fix: The authors found that to fix the death rate guess, you need to know two things:
1. How many people are in the room (Sample size).
2. The ratio of people leaving to people arriving (Extinction fraction).
  Their new formula combines these two factors to give a much more accurate picture.

4. The "Net Result" (The Final Score)

Scientists often care about the "Net Diversification" rate. This is simply: Birth Rate minus Death Rate. It tells you how fast a group is actually growing.

The Problem: Because the birth rate guess was too low and the death rate guess was slightly off, the final "Net" score was also wrong. It was like subtracting a slightly too-high number from a slightly too-low number, resulting in a very inaccurate final score.
The Good News: When they applied their new corrections, the "Net" score got much better. However, they found that Turnover (Birth + Death) is actually a more stable and reliable number to look at than the "Net" growth, because the errors in birth and death tend to cancel each other out when you add them, but they make things worse when you subtract them.

The Big Takeaway

This paper is a "user manual update" for evolutionary biologists.

Don't ignore the small trees: If you have a tiny family tree, don't just delete it. If you do, you must adjust your math to account for the deletion.
Apply the correction: The standard formulas are "shy." Use the new multipliers the authors provided to wake them up and get the right numbers.
Be careful with "Net" growth: If you are studying how fast a group is growing, be aware that your numbers might be underestimating the truth unless you use these new corrections.

In short, the authors didn't just find a bug; they fixed the code so that when we look at the history of life on Earth, we see it more clearly and accurately than ever before.

1. Problem Statement

The accurate estimation of speciation ( $\lambda$ ) and extinction ( $\mu$ ) rates from phylogenetic trees is fundamental to evolutionary biology. However, the authors identify two critical sources of bias that compromise these estimates, particularly in small clades or when analyzing sub-regimes within larger trees:

Statistical Bias: Systematic deviations in the expected values of maximum likelihood estimators (MLE) from their true generating parameters. Specifically, standard estimators tend to underestimate rates, a phenomenon exacerbated by small sample sizes.
Structural Bias: Arising from how likelihood functions are conditioned and how data is filtered.
- Cherry Trees ( $n=2$ ): Many standard likelihood formulations (e.g., Stadler 2013) are undefined for two-taxon trees ("cherry trees") because the product term over speciation events becomes empty. Consequently, these trees are often implicitly excluded from analyses.
- Conditioning: Excluding $n=2$ trees introduces a secondary layer of conditioning (observing $n > 2$ ) that is rarely accounted for in the likelihood calculation, leading to inflated rate estimates for young clades.
- Identifiability: The authors demonstrate that cherry trees lack sufficient information to jointly estimate $\lambda$ and $\mu$ , making their exclusion necessary but requiring proper statistical correction.

2. Methodology

The authors employed a combination of analytical derivation, simulation, and machine learning techniques to address these biases:

Analytical Derivation:
- Re-derived the expected bias for the Yule process ( $\mu=0$ ) estimator.
- Derived new likelihood functions conditioned on observing $n > 2$ taxa for three scenarios: the general birth-death model ( $\lambda \neq \mu$ ), the Yule model ( $\mu=0$ ), and the critical branching process ( $\lambda = \mu$ ).
- Proved analytically that cherry trees cannot identify both $\lambda$ and $\mu$ by examining the log-likelihood surface and its partial derivatives.
Simulation Studies:
- Simulated 100,000+ trees under constant-rate birth-death processes to quantify the bias in standard estimators when $n=2$ trees are excluded.
- Generated large datasets (250,000 Yule trees; 500,000 birth-death trees) using Latin hypercube sampling to cover a wide range of clade ages, speciation rates, and extinction fractions ( $\epsilon = \mu/\lambda$ ).
Symbolic Regression:
- Since analytical bias corrections for the general birth-death model are intractable due to non-linear dependencies, the authors used symbolic regression (via the R package gramEvol) to discover functional forms that minimize the bias.
- They defined a grammar allowing multiplicative corrections of the form $\hat{\theta}_{corr} = \hat{\theta} \cdot c(\cdot)$ , where $c$ is a function of sample size ( $n$ ) and estimated parameters.
- The search was validated first on Yule data (where the analytical solution is known) and then applied to general birth-death data.

3. Key Contributions and Results

A. Structural Bias and Conditioning

The "Cherry Tree" Problem: The authors confirmed that excluding $n=2$ trees without adjusting the likelihood leads to upwardly biased rate estimates, particularly in young clades.
Solution: They derived a corrected likelihood function $L(\lambda, \mu | n > 2)$ that explicitly conditions on the probability of observing more than two extant taxa. Applying this conditioning removes the upward trend in rate estimates observed in simulations.

B. Statistical Bias Corrections

Using symbolic regression and analytical proofs, the authors derived specific correction factors for different parameters:

Speciation Rate ( $\lambda$ ):
- Finding: The standard MLE $\hat{\lambda}$ underestimates the true rate.
- Correction: The optimal correction is identical for both Yule and general birth-death models:
  $\hat{\lambda}_{corr} = \hat{\lambda} \times \frac{n-1}{n-2}$
- This correction was analytically derived for Yule and confirmed via symbolic regression for birth-death models.
Extinction Rate ( $\mu$ ):
- Finding: The bias in $\mu$ is more complex, depending on both sample size ( $n$ ) and the estimated extinction fraction ( $\hat{\epsilon} = \hat{\mu}/\hat{\lambda}$ ).
- Correction: The best-performing correction identified by symbolic regression is:
  $\hat{\mu}_{corr} = \hat{\mu} \times \left( \frac{n}{n-1} + \hat{\epsilon} \right)$
- Simpler corrections (e.g., constant multipliers) were less accurate, highlighting the structural coupling between $\lambda$ and $\mu$ .
Derived Parameters:
- Turnover ( $\tau = \lambda + \mu$ ): Because the biases in $\lambda$ (underestimation) and $\mu$ (slight overestimation after correction) are roughly opposite, the sum (turnover) is nearly unbiased.
- Net Diversification ( $r = \lambda - \mu$ ): This parameter remains systematically underestimated. The error is dominated by the bias in $\mu$ . The authors recommend applying the same correction factor used for $\mu$ to net diversification:
  $\hat{r}_{corr} = \hat{r} \times \left( \frac{n}{n-1} + \hat{\epsilon} \right)$

C. Identifiability Limits

The paper provides a rigorous proof that cherry trees ( $n=2$ ) are unidentifiable for separate $\lambda$ and $\mu$ estimation. The log-likelihood surface for a two-taxon tree has no maximum in the $\lambda$ direction (it is strictly decreasing) and is flat in the $\mu$ direction for certain parameter spaces. Therefore, excluding these trees is statistically sound, provided the likelihood is conditioned on $n>2$ .

4. Significance and Implications

Framework for Inference: The paper provides a general framework for reducing bias in diversification rate estimation, particularly for studies involving small clades, young lineages, or methods that subdivide trees (e.g., BAMM, MEDUSA, ClaDS, MiSSE).
Practical Recommendations:
- Researchers should exclude $n=2$ trees but must use the conditioned likelihood ( $n>2$ ) to avoid structural bias.
- Estimated rates should be multiplied by the derived correction factors (Table 4 in the paper) to obtain unbiased estimates.
- Net diversification is a particularly unstable metric in small samples due to the asymmetry in bias correction; Turnover is recommended as a more robust summary statistic.
Bayesian Context: The authors note that Bayesian priors do not automatically fix these likelihood-based biases. The recommended approach is to estimate parameters normally and then apply these post-hoc corrections.

In summary, Beaulieu and O'Meara clarify that the "small sample size" problem in diversification studies is twofold: a lack of data in tiny trees and a statistical artifact of how those trees are handled. By providing explicit correction formulas and conditioned likelihoods, they offer a path toward more accurate macroevolutionary inference.

Statistical and structural bias in birth-death models

1. The "Cherry Tree" Problem (The Missing Piece)

2. The "Under-estimator" Bias (The Shy Calculator)

3. The "Extinction" Trap (The Harder Puzzle)

4. The "Net Result" (The Final Score)

The Big Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions and Results

A. Structural Bias and Conditioning

B. Statistical Bias Corrections

C. Identifiability Limits

4. Significance and Implications

More like this

A critical look at directional random walk modeling of sparse fossil data

Inferring evolutionary relationships among Crenotia species (Bacillariophyta): Evidence from natural populations and monoclonal strains from Slovakia

Emergent frequency-dependent selection predicts mutation outcomes in complex ecological communities

Genome expansions and regulatory contact entanglement help preserve ancestral metazoan synteny

Rapid adaptation follows experimental assisted gene flow in subset of annual monkeyflower populations