On the Coalescence Time Distribution in Multi-type Supercritical Branching Processes

Imagine a vast, ever-expanding family tree. This isn't just a human family, but a population of organisms, cells, or even ideas, where every individual has a chance to have children, and those children have their own children, and so on.

This paper is about tracing the family tree backwards to find out: When did a specific group of people last share a common ancestor?

Here is the breakdown of the paper's story, using simple analogies.

1. The Setting: A Growing Family Tree

The authors are studying a "Supercritical Branching Process."

The Analogy: Imagine a family where, on average, every person has more than one child (say, 1.5 children). Because they are having more kids than they are dying, the family grows explosively fast.
The Twist: Sometimes, a family line dies out completely (extinction). But if the family is "supercritical," there's a good chance it will survive and grow huge.
The Complexity: In this paper, the family isn't just one big group. There are different "types" of people (like different species of birds, or different types of cells), and each type has its own rules for how many babies they have.

2. The Problem: Finding the "Grandparent"

The researchers ask a specific question:

"If I pick k random individuals from a very large generation (Generation T), how far back in time do I have to go to find their Most Recent Common Ancestor (MRCA)?"

The Metaphor: Imagine you are at a massive reunion of 1 million people (Generation T). You pick 5 random people. You want to know: "Who is the oldest person in the family tree that is an ancestor to all five of us?"
The Challenge: If you try to simulate this by running the family tree forward from the beginning, the numbers get so huge (billions of people) that your computer crashes. It's like trying to count every grain of sand on a beach to find out where two specific grains came from.

3. The Solution: A Time-Travel Shortcut

The authors developed a mathematical "time machine" to solve this without simulating the whole explosion of the population.

A. The "Limiting Distribution" (The Shape of the Future)

They realized that even though the population size changes wildly, if you look at the shape of the family growth over a long time, it settles into a predictable pattern.

The Analogy: Think of a balloon being inflated. The size changes every second, but the way it stretches follows a specific rule. If you know that rule, you don't need to measure the balloon at every second; you can predict its shape later.
The Result: They created a formula that uses this "shape" (called the limiting distribution) to calculate the probability of when the ancestors met.

B. The "Harmonic Mean" (The Average of the Small)

To make the math work, they looked at "harmonic moments."

The Analogy: Imagine you have a bag of numbers. The "average" (arithmetic mean) is easily skewed by one huge number (like a billionaire in a room of poor people). The "harmonic mean" is more sensitive to the small numbers.
Why it matters: In a growing family tree, the "small" families (the ones that almost died out) actually hold the key to understanding how the whole tree connects. The authors found that by looking at these "small" averages, they could put upper and lower bounds on the answer. It's like saying, "The ancestor definitely lived between 10 and 20 generations ago," even if you can't pinpoint the exact year.

C. The "Harris-Sevastyanov Transformation" (The Magic Filter)

This is the paper's most clever trick.

The Problem: Calculating the "small family" averages is still hard because the original family tree is messy and can die out.
The Solution: They invented a mathematical filter (a transformation) that turns the messy, dying family tree into a perfect, immortal family tree where extinction is impossible.
The Analogy: Imagine you are trying to study the weather in a stormy, chaotic ocean. It's hard to measure. So, you use a special lens that turns the stormy ocean into a calm, flat lake. You do your measurements on the calm lake, and then use math to translate those results back to the stormy ocean.
The Benefit: This "calm lake" (the transformed process) is much easier to calculate. It allows them to estimate the answer using only the first generation of data, rather than simulating thousands of generations.

4. The Results: Fast and Accurate

The authors tested their math with computer simulations.

The Finding: Their new method is much faster than the old way of simulating the whole family tree.
The "Supercritical" Bonus: The more explosive the family growth (the more "supercritical" it is), the better their method works.
Real-world Example: In a scenario where the population grows so fast that a direct simulation would require 5 Gigabytes of memory (and crash), their method solved the problem in a fraction of a second using a tiny amount of memory.

Summary

This paper is about finding the common ancestor of a group in a rapidly growing population.

Instead of trying to count every single person in a massive, exploding family tree (which is impossible), the authors found a way to:

Look at the long-term shape of the growth.
Use a mathematical filter to turn the messy, dying tree into a simple, immortal one.
Calculate the answer using simple averages of that simple tree.

It's like figuring out who the great-grandparents of a million people were, not by interviewing a million people, but by looking at a single, cleverly designed blueprint of the family's growth.

Here is a detailed technical summary of the paper "On the Coalescence Time Distribution in Multi-type Supercritical Branching Processes" by Krasnowska, Jenkins, and Johansen.

1. Problem Statement

The paper addresses the challenge of characterizing the genealogy of a multi-type supercritical Galton–Watson branching process. Specifically, it seeks to determine the distribution of the Most Recent Common Ancestor (MRCA) generation ( $t$ ) for a sample of $k$ individuals drawn uniformly at random from generation $T$ , in the limit as $T \to \infty$ .

While the genealogy of single-type branching processes and critical multi-type processes has been studied, the supercritical multi-type case (where the population grows exponentially with probability $1-q$) presents unique difficulties:

The possibility of extinction (where the lineage dies out).
The potential for a countably infinite number of types.
The computational intractability of simulating genealogies for large $T$ due to the exponential growth of the population size.

The goal is to derive an explicit formula for the coalescence time distribution and provide computable bounds that do not require simulating the entire forward process.

2. Methodology

The authors employ a combination of probabilistic limit theorems, generating function analysis, and transformation techniques.

A. Limiting Behavior and Decomposition

The core of the analysis relies on the branching property. The population at generation $T$ is decomposed into independent families originating from individuals alive at an earlier generation $t$ .

They utilize the limiting random variable $W^{(i)}$ , which represents the normalized population size of a lineage starting with a type- $i$ ancestor (scaled by $R^{-T}$ , where $R$ is the radius of convergence of the mean matrix).
By applying the Continuous Mapping Theorem and a modified Conditional Dominated Convergence Theorem, they express the probability of coalescence in terms of the limiting variables $W^{(i)}$ rather than the finite population sizes.

B. The Harris–Sevastyanov Transformation

A major methodological innovation is the application of a multi-type generalization of the Harris–Sevastyanov transformation.

Problem: Calculating harmonic moments of the total population size $|Z_t|$ conditioned on non-extinction is difficult for large $t$ .
Solution: They transform the original supercritical process $(Z_t)$ (which can go extinct) into a new process $(Y_t)$ that cannot go extinct (extinction probability is 0).
This transformation allows them to bound the harmonic moments of the original process $|Z_t|$ using the moments of the transformed process $|Y_1|$ at the first generation, which are much easier to compute.

C. Harmonic Moments and Bounds

The paper derives upper and lower bounds for the coalescence probability using harmonic moments ( $E[|Z_t|^{-r} | |Z_\infty| > 0]$ ).

Theorem 4 provides bounds in terms of these harmonic moments.
Theorem 5 links these moments to the expectations of the transformed process $Y$ , establishing an exponential rate of convergence for the bounds.

3. Key Contributions

1. Generalized Coalescence Formula (Theorem 3)

The authors derive a formula for the limiting probability that $k$ sampled individuals coalesce before generation $t$ :
$\lim_{T \to \infty} P(X_{T,k} < t \mid |Z_T| \ge k) = 1 - E\left[ \frac{\sum_{i \in S} \sum_{s=1}^{Z_{t}^{i}} (W_{t,s}^{(i)})^k}{(\sum_{i \in S} \sum_{s=1}^{Z_{t}^{i}} W_{t,s}^{(i)})^k} \;\Bigg|\; |Z_\infty| > 0 \right]$

Significance: This extends previous results (e.g., [21]) to processes that can go extinct and allows for countably infinite types, provided certain moment conditions are met.

2. Effective Bounds via Transformation (Theorems 4 & 5)

They provide explicit upper and lower bounds for the coalescence probability that depend on:

The harmonic moments of the population size.
The Harris–Sevastyanov transformed process at generation 1.
These bounds converge exponentially as $t$ increases, making them practical for approximating coalescence times in regimes where direct simulation is impossible.

3. Numerical Approximation Algorithm

The paper presents a computational framework (Algorithm 1) to approximate the distribution of the limiting variable $W^{(j)}$ :

It uses the characteristic function of $W^{(j)}$ derived from the generating function of the offspring distribution.
It employs Taylor series expansions for moments and Inverse Discrete Fourier Transform (IDFT) to recover the probability density function.
This allows for Monte Carlo sampling of the limiting variable, enabling the numerical estimation of the coalescence probability without simulating the full tree.

4. Results

Theoretical Validity: The derived formula (Theorem 3) is proven to hold under Assumptions 1–5 (irreducibility, supercriticality, finite moments, and non-trivial extinction probabilities).
Convergence Rates: The bounds show that the probability of coalescence converges to 1 at an exponential rate determined by the spectral properties of the mean matrix and the Harris–Sevastyanov transformation.
Numerical Performance:
- The authors tested their method on systems with $d=2$ types (Poisson offspring distributions).
- Accuracy: The approximated density of $W$ and the estimated coalescence probabilities closely matched direct simulations for "slightly supercritical" systems.
- Efficiency: For "significantly supercritical" systems (where population sizes reach $10^9$), direct simulation became infeasible (requiring ~5GB RAM and taking hours), whereas the proposed method using Theorem 3 completed in seconds (approx. 0.2s to 16s depending on the step).
- Tightness: The bounds become tighter as the leading eigenvalue (growth rate) of the system increases.

5. Significance

Bridging Theory and Practice: The paper provides a rigorous theoretical framework for multi-type coalescent processes that was previously lacking, particularly for supercritical systems with extinction risks.
Computational Feasibility: By avoiding the need to simulate the forward genealogy of exponentially growing populations, the method enables the study of genealogies in biological and epidemiological models where populations grow rapidly (e.g., early stages of an epidemic or cell proliferation).
Methodological Advancement: The generalization of the Harris–Sevastyanov transformation to multi-type settings and its application to bounding harmonic moments offers a new tool for analyzing the tail behavior of branching processes.
Extensibility: The framework is applicable to both finite and countably infinite type spaces, making it relevant for complex biological models with diverse cell or species types.

In summary, this paper successfully solves the problem of characterizing the MRCA distribution in complex, growing, multi-type populations by combining advanced probabilistic limits with a transformative approach that renders the problem computationally tractable.