Sequential learning theory for Markov genealogy processes

Imagine you are trying to reconstruct a family tree for a group of people, but you only have a few scattered photos and birth dates. You want to figure out when the common ancestor lived, how fast the family grew, or how long the whole tree is. This is the job of phylodynamics—using genetic sequences to understand the history of life (like viruses or species).

A common question scientists ask is: "If I add more people (taxa) to my analysis, will my answer get better?"

Intuitively, you'd think "Yes, more data is always better." But in reality, adding more sequences can sometimes make the computer confused, increase uncertainty, or even lead to wrong answers.

This paper introduces a new way to think about this problem. It uses a clever mathematical framework to explain when adding data helps, why it sometimes hurts, and what the fundamental limits of our knowledge are.

Here is the breakdown using simple analogies:

1. The "Random Lineup" Analogy (The Filtration)

Imagine you have a bag of 100 puzzle pieces (genetic sequences), but you only look at them one by one in a random order.

The Setup: The authors imagine shuffling the order in which we see the data. As we pull out piece #1, then #2, then #3, we build a "filtration"—a growing picture of the tree.
The Goal: They want to see how our confidence changes with every new piece we add.

2. The Three Forces of Change

When you add a new piece to your puzzle, your uncertainty (variance) changes due to three competing forces. The authors break this down like a financial transaction:

Learning (The Good): You get new information. You learn something you didn't know before. This usually lowers uncertainty.
Mismatch (The Bad/Confusing): The "target" you are aiming for keeps moving!
- Example: If you are trying to guess the age of the entire family tree, but you only have 3 people, your guess is based on a tiny branch. When you add a 4th person, the "true age of the whole tree" might shift because the new person connects to a much older part of the family. Your target moved, so your previous guess was suddenly "wrong" in a new way. This creates a "mismatch" that can temporarily increase confusion.
Covariance (The Relationship): How the "Learning" and the "Mismatch" interact. Sometimes they cancel each other out; sometimes they make things worse.

The Takeaway: Adding data on average reduces uncertainty about the final truth, but the path there is bumpy. Sometimes the "Mismatch" force is so strong that adding a new sequence makes you less sure about your current guess, even though you are getting closer to the ultimate truth.

3. The "Oracle" vs. The "Analyst" (The Core Discovery)

This is the most fascinating part of the paper. The authors introduce two characters:

The Analyst (You): You see the data as it comes in. You don't know the full family tree yet. You are guessing.
The Oracle (The All-Knowing Being): This being sees the entire hidden family tree (the "latent genealogy") instantly. The Oracle knows exactly when the puzzle is "solved" for a specific question.

The "Absorbing" Moment:
Imagine you are trying to find the oldest ancestor.

The Analyst: Keeps adding people, wondering, "Is this the oldest one yet? Maybe the next person will be older?"
The Oracle: Knows immediately when the group of people you have already seen is enough to lock in the answer. Once you have seen a "straddling" group (people from two different branches that force the root to be a certain age), the answer is absorbed. It cannot change no matter who you add later.

The Gap:
The Oracle knows the answer is "locked in" the moment it happens. The Analyst does not know this. The Analyst keeps worrying that the answer might change.

The Result: Even after you have seen all the available data, the Analyst is still slightly more uncertain than the Oracle.
Why? Because the Analyst doesn't know if they have already reached the "absorbing" state. The Analyst is haunted by the "what ifs" of the hidden tree structure.

4. The Fundamental Limit

The paper concludes with a sobering but important truth: There is a hard limit to what sequence data alone can tell us.

Even if you have perfect data and perfect math, you can never be as certain as the Oracle who sees the hidden structure. There is a permanent "gap" in knowledge caused by the fact that we are looking at a shadow (the sequences) rather than the object itself (the full genealogy).

Summary in a Nutshell

Adding data usually helps, but not always immediately. Sometimes it confuses you because the target you are aiming at shifts.
We can categorize problems based on how the "target" behaves (does it settle down quickly, or does it keep moving?).
There is a "blind spot." We can never fully know if we have found the final answer just by looking at the data we have, because we don't know the hidden structure of the family tree.
The "Oracle Gap" is real. There is a fundamental limit to how certain we can ever be, simply because we are missing the "behind-the-scenes" view of the full history.

This paper gives scientists a new map to understand why their computer models sometimes get "jumpy" when adding new data, and it sets realistic expectations about the limits of what we can learn from genetic sequences alone.

Here is a detailed technical summary of the paper "Sequential learning theory for Markov genealogy processes" by David J. Pascall.

1. Problem Statement

In phylodynamic inference, a fundamental practical question is whether adding more taxa (sequences) to an analysis always improves estimation. Empirical observations suggest this is not always true; additional sequences can sometimes increase posterior uncertainty, degrade mixing, or amplify model misspecification. However, there is a lack of theoretical foundations explaining when and why adding taxa helps or hurts.

The paper addresses the gap in understanding how the uncertainty of specific estimands (quantities of interest) evolves as data is sequentially added, particularly when the estimand itself changes with the sample size (e.g., the time to the most recent common ancestor, tMRCA, of the included tips).

2. Methodology

The author introduces a filtration-based framework to model sequential learning in Markov Genealogy Processes (MGPs).

Mathematical Setup:
- The framework operates on a probability space supporting a random element $\Delta = (\Theta, G, \Lambda)$ , where $\Theta$ represents model parameters, $G$ is the latent genealogy (tree), and $\Lambda$ is a random permutation of the observed tips.
- Filtration Construction: By applying a uniform random permutation $\Lambda$ to the observed tips, the author constructs a natural ordering of data $D_n = (Y_1, ..., Y_n)$ . This generates a filtration $\mathcal{F}_n = \sigma(D_n)$ , allowing the application of standard sequential Bayesian analysis results.
- Estimands: The paper distinguishes between:
  - Permutation-invariant estimands: Fixed targets (e.g., substitution rates) that do not change as $n$ increases.
  - Permutation-variant (Sequential) estimands: Targets that depend on the specific set of observed tips (e.g., tMRCA of the current sample). These have a limit estimand ( $K_\infty$ ), representing the value if the entire latent genealogy were observed.
Variance Decomposition:
The core analytical tool is the decomposition of the expected posterior variance of a sequential estimand $K_n$ relative to its limit $K_\infty$ . The change in variance upon adding a taxon is broken down into three components:
1. Learning: The change in uncertainty about the current sequential target.
2. Mismatch: The change in uncertainty regarding the distance between the current target and the limit target ( $K_\infty - K_n$ ).
3. Covariance: The change in the covariance between the current target and the mismatch.

3. Key Contributions

A. Taxonomy of Learning Classes

The paper classifies sequential estimands based on the pathwise behavior of the "mismatch" ( $|K_\infty - K_n|$ ) as $n$ increases. The classes include:

Fixed: Constant estimands (e.g., clock rates).
Absorbing Monotonic: The mismatch decreases monotonically, and equality with the limit is reached with positive probability before the maximum sample size (e.g., tMRCA).
Absorbing Non-monotonic: Equality is reached, but the path is not monotonic.
Terminal Monotonic/Non-monotonic: The limit is never reached within the sample size.
Mixed/Non-absorbing: Complex behaviors where equality is reached and then lost, or never reached.

B. The Oracle vs. Analyst Gap

A central theoretical contribution is the introduction of an Oracle who knows the "absorption status" (i.e., whether the current sequential estimand has already equaled the limit estimand, denoted by the random time $\tau$ ).

Analyst: Only observes the data filtration $\mathcal{F}_n$ .
Oracle: Observes the expanded filtration $\mathcal{F}'_n = \sigma(D_n, \tau)$ .

The paper proves that the Oracle obtains event-wise learning guarantees (variance reduction in expectation for every specific realization of the absorption event) that the Analyst cannot access.

C. Irreducibility of Uncertainty

The paper establishes a fundamental limit on inference from sequence data alone. Even after observing all sampled tips, the Analyst's posterior variance strictly exceeds the Oracle's expected posterior variance. This gap is irreducible under stochastic sampling processes because the Analyst lacks knowledge of the latent genealogy's structure (specifically, whether the current sample has "straddled" the root or reached the limit).

4. Key Results

Proposition 1 (Standard Learning): For permutation-invariant estimands, adding taxa always reduces expected posterior variance (standard Law of Total Variance).
Theorem 1 (Sequential Learning Decomposition): For general sequential estimands, the expected variance reduction is the sum of the Learning, Mismatch, and Covariance terms. While the sum is non-negative (guaranteed by Proposition 1), individual terms can be negative, explaining why adding taxa might locally increase uncertainty.
Corollary 1 (Oracle Guarantees): The Oracle achieves guaranteed variance reduction event-wise because they know the absorption status, allowing them to ignore the "mismatch" and "covariance" terms that burden the Analyst.
Theorem 3 (Irreducibility of the Oracle Gap): Even with full observation of sampled tips, the Analyst's uncertainty remains strictly higher than the Oracle's if there is a non-zero probability of absorption. This gap arises from the Analyst's inability to determine if the current sample is "complete" regarding the limit target (e.g., if the sample has straddled the root of the tree).

5. Significance and Implications

Theoretical Explanation of Instability: The framework explains why adding taxa can sometimes degrade inference. It is not a failure of the model, but a consequence of the "mismatch" term in the variance decomposition. When the target shifts (e.g., adding a tip changes the tMRCA of the sample), the uncertainty about the distance to the true limit can temporarily outweigh the learning gained about the new target.
Fundamental Limits of Phylodynamics: The paper demonstrates that sequence data alone has a hard limit on what can be learned about the latent genealogy. Without knowledge of the latent process structure (specifically the absorption status), there is an irreducible uncertainty gap between what an analyst can infer and what is theoretically knowable.
Practical Guidance: By classifying estimands into learning classes, the theory helps practitioners anticipate the behavior of specific metrics (like tMRCA vs. tree length) as datasets grow. It suggests that for "absorbing" estimands, early additions may yield high variance due to mismatch, while later additions (post-absorption) behave like standard learning.

In summary, Pascall provides a rigorous mathematical framework that moves beyond the assumption that "more data is always better," revealing the complex interplay between data accumulation, target shifting, and latent structural uncertainty in phylogenetic inference.