The limits of Bayesian estimates of divergence times in… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to figure out exactly when a family reunion happened, but you only have a blurry photo of the family tree and a few scattered notes about how fast the family members have been aging. This is essentially what scientists do when they try to date the history of viruses and bacteria using their genetic code.

This paper is like a detective story about how accurate our "time machines" really are when we try to trace the history of fast-evolving microbes like the flu or Hepatitis B.

Here is the breakdown of their findings using simple analogies:

1. The Problem: The "Speed vs. Time" Confusion

Imagine you are driving a car. You know the distance you traveled (the genetic changes in the virus), but you don't know your speed (how fast the virus mutates) or exactly how long you've been driving.

The Catch: If you drive 100 miles, you could have done it in 2 hours at 50 mph, or 10 hours at 10 mph. You can't tell the difference just by looking at the odometer.
In Science: This is called the "identifiability problem." To solve it, scientists usually have to make a guess (a "prior") about the speed or the time. Because of this guess, there is always a limit to how precise our time estimates can be, even if we had infinite data.

2. The Old Theory vs. The New Discovery

The Old Theory (Ultrametric Trees):
Previously, scientists thought that for any tree of life, the older a branch was, the fuzzier the date would be. It was like looking at a tree: the roots are so far back in time that it's hard to be sure exactly when they started growing. The further back you go, the bigger the "maybe" zone becomes.

The New Discovery (Measurably Evolving Populations):
The authors looked at viruses and bacteria that are sampled over time (like taking a photo of the flu in January, then March, then June). They found the old theory was wrong for these cases.

The New Rule: It doesn't matter how old the branch is; it matters how close it is to a sample we actually have.
The Analogy: Imagine a family tree where you have photos of your great-grandparents (old samples) and your parents (new samples).
- If you are trying to guess the birth date of your grandparent, and you have a photo of your parent (who is close in time), you can guess very accurately.
- If you are trying to guess the birth date of your great-grandparent, but the closest photo you have is of your parent (who is far away in time), your guess will be very fuzzy, even if the great-grandparent isn't that "old" in the grand scheme of things.
The Takeaway: The uncertainty depends on the distance to the nearest known sample, not the absolute age of the event.

3. The "Infinite Data" Dream

The paper asks: "What if we had infinite data? Would our guesses become perfect?"

The Answer: Yes, but with a twist. Even with infinite data, there is a "floor" to how precise we can be.
The Analogy: Think of trying to hear a whisper in a noisy room. If you add more microphones (more data), the whisper gets clearer. But if the room is too big (too many unknown variables), there's still a tiny bit of static you can't eliminate.
The Reality Check: The authors ran simulations showing that to get "perfect" precision, you would need a dataset so huge it's practically impossible for real-world outbreaks. For example, to get perfect precision on a virus like the flu, you'd need a dataset with nearly 100,000 unique genetic patterns. Real outbreaks usually have far fewer.

4. Why Some Viruses Are Easier to Date Than Others

The paper compared the Flu (fast mutator) and Hepatitis B (slow mutator).

The Flu: It changes so fast that even a few months of data gives us a lot of "clues" (mutations). It's like a fast-forwarding video; you can see the action clearly. The uncertainty is small (maybe a few weeks).
Hepatitis B: It changes very slowly. Even with thousands of years of data, it's like watching a video in extreme slow motion where nothing seems to happen. The uncertainty is huge (hundreds of years).
The Lesson: Just having more samples doesn't always help if the virus isn't changing fast enough to give you new clues.

5. The "Calibration" is Key

The most important tool in the scientist's kit is a calibration point—a sample with a known date.

The Analogy: If you are trying to guess the time of a crime, having a witness who saw the suspect at 2:00 PM is great. But if that witness is 10 miles away from the crime scene, your guess about the exact time of the crime gets worse the further away the witness is.
The Finding: To get a precise date for a specific event in a virus's history, you need a sample that is genetically close and time-close to that event. If the closest sample is far away in the family tree, your date estimate will be shaky, no matter how much data you have.

Summary for the General Public

This paper tells us that while we are getting better at tracking viruses, there are hard limits to how precise our time estimates can be.

Distance matters more than age: It's not about how old the virus is, but how close we are to a sample with a known date.
Data has a ceiling: We can't just "collect more data" to get perfect answers. Real-world outbreaks often don't have enough genetic changes to give us perfect precision.
Fast is better: Viruses that mutate quickly (like Flu) are easier to date precisely than slow ones (like Hepatitis B).

The Bottom Line: When scientists say a virus emerged "6 months ago with a margin of error of 2 weeks," that margin of error isn't just a mistake; it's a fundamental limit of physics and math based on how much information the virus actually gave us. This paper helps us understand exactly what that limit is.

1. Problem Statement

The paper addresses a fundamental limitation in Bayesian phylogenetic inference regarding the estimation of divergence times (evolutionary timescales) using molecular data.

The Identifiability Problem: In standard phylogenetic analyses of extant species (ultrametric trees), evolutionary times and molecular rates are confounded; only their product (branch length) is statistically identifiable. Consequently, priors on time and rate are required to break this confounding, establishing a theoretical lower bound on uncertainty even with infinite data.
The Gap: While "infinite-sites theory" (Yang & Rannala, 2006; Rannala & Yang, 2007) describes how uncertainty behaves in extant taxa (where uncertainty scales with node age), its behavior in measurably evolving populations (heterochronous data, such as viruses and bacteria sampled over time) remains unexplored. In these populations, sampling times provide calibration, theoretically making times and rates identifiable. However, real-world datasets (e.g., viral outbreaks) are often small, and it is unclear how uncertainty scales with data size or if it converges to zero.
Key Question: How does the uncertainty in divergence time estimates scale with data size (number of sites/loci) in heterochronous datasets, and what are the theoretical limits of precision?

2. Methodology

The authors employed a combination of extensive simulation experiments and empirical data analysis using a Bayesian phylogenetic framework (BEAST2 v2.7.7).

Data Sources:
- Empirical: 2009 H1N1 influenza virus data (North America, sampled June, August, December) and Hepatitis B Virus (HBV) data (100 genomes, including ancient samples up to 5,000 years old).
- Simulations: Based on the empirical H1N1 trees, the authors generated 900 (strict clock) and 180 (relaxed clock) simulated datasets.
Simulation Design:
- Tree Scaling: To vary information content without changing sequence length, they scaled the total tree length (sum of branch lengths) to three levels: $4 \times 10^{-4}$ , $5 \times 10^{-3}$ , and $2$ substitutions/site. This resulted in approximately 80, 800, and 95,000 unique site patterns, respectively.
- Models: Analyses were run under both Strict Molecular Clock and Relaxed Molecular Clock (lognormal distribution) models.
- Topology: To isolate the effect of sequence information on time estimation, the tree topology was fixed to the "true" generating tree for simulations, avoiding the confounding factor of topological uncertainty in low-information datasets.
Analytical Approach:
- The authors calculated the 95% Highest Posterior Density (HPD) interval width for internal node ages.
- They tested two hypotheses for the predictor variable: (1) Absolute node age vs. (2) Distance to the closest tip with a known sampling time (tip-calibration).
- They performed linear regressions of HPD width against these predictors to assess "infinite-sites behavior" (linearity, slope, y-intercept, and RMSE).

3. Key Contributions & Findings

A. Redefining the Source of Uncertainty in Heterochronous Data

Contrary to the infinite-sites theory for extant taxa (where uncertainty scales with absolute node age), the authors found that in non-ultrametric (heterochronous) trees, uncertainty scales positively with the distance to the closest tip-calibration.

Nodes closer to a sampled tip (known age) have lower uncertainty, regardless of their absolute age relative to the root.
Nodes further from any sampled tip exhibit higher uncertainty, even if they are "young" in absolute terms.

B. Theoretical Limits of Precision (Infinite-Sites Behavior)

Convergence: As the number of unique site patterns increases (approaching "infinite" information), the relationship between uncertainty and tip-calibration distance becomes increasingly linear with a flatter slope and lower overall uncertainty.
The Y-Intercept: Even with infinite data, the regression y-intercept (uncertainty for a node with zero distance to a tip) is not zero. This indicates a theoretical minimum uncertainty exists, even for nodes immediately adjacent to a calibration point.
- Example: In simulations, this minimum uncertainty was estimated at 1–2 weeks for fast-evolving viruses.
Data Requirements: Achieving true infinite-sites behavior (slope approaching zero) requires massive amounts of data. In their simulations, ~95,000 unique site patterns were needed to approach this limit.

C. Impact of Model Complexity and Taxa Count

Relaxed Clocks: Relaxed molecular clock models (estimating rate variation) require significantly more data to achieve the same precision as strict clocks due to the increased number of parameters. With low information content, the prior on rate variation dominates, masking differences between clock models.
Taxa Density: Adding more taxa (e.g., December vs. June in the H1N1 dataset) increases the parameter space (more branch lengths/nodes). Consequently, datasets with more taxa require more site patterns to achieve the same level of precision as smaller datasets.

D. Empirical Validation (Influenza vs. HBV)

Influenza (H1N1): Despite having fewer unique site patterns (~1,500) than HBV, the influenza dataset exhibited behavior closer to the infinite-sites limit (slope = 0.345, min uncertainty ~2 weeks) due to its high evolutionary rate and short timescale.
HBV: Despite having more site patterns (~2,100), HBV showed higher uncertainty (slope = 0.669, min uncertainty ~175 years) due to its slower evolutionary rate and the inclusion of ancient samples spanning thousands of years.

4. Significance and Implications

Outbreak Investigation: The study establishes that for most real-world viral outbreak datasets (which typically have limited sequence length and few unique site patterns), estimates of evolutionary timescales will never reach the "infinite-sites" limit of zero uncertainty.
Practical Guidance: The uncertainty in divergence time estimates is not solely a function of the total number of sites but is heavily dependent on:
1. The information content (unique site patterns).
2. The phylogenetic distance of the node of interest to the nearest sampled tip.
3. The complexity of the model (strict vs. relaxed clock).
Ancient DNA: The inclusion of ancient DNA samples acts as crucial tip-calibrations, reducing the "distance" to deep nodes and thereby lowering uncertainty. However, the utility depends on the specific microbe's evolutionary rate and the age of the samples.
Methodological Framework: The authors propose the "infinite-sites plot" (regression of HPD width vs. distance to tip-calibration) as a diagnostic tool. Researchers can use the slope, RMSE, and y-intercept of this plot to determine the theoretical minimum uncertainty achievable for a specific dataset before investing in further sequencing.

Conclusion

The paper demonstrates that while measurably evolving populations theoretically allow for the identification of times and rates, practical constraints (small dataset sizes, model complexity) impose a non-zero lower bound on uncertainty. The precision of divergence time estimates is fundamentally limited by the density of sampling in time and the amount of unique evolutionary information (site patterns) available, rather than just the absolute age of the nodes. This framework provides a critical tool for setting realistic expectations in microbial phylodynamics and outbreak investigations.

The limits of Bayesian estimates of divergence times in measurably evolving populations