Ancestral state reconstruction with discrete characters using deep learning

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a cold case. You have a family tree of suspects (a phylogeny) and you know what the current suspects look like (their traits, like eye color or location). Your goal is to figure out what their great-great-grandparents looked like, even though those ancestors are long dead and left no photos. This is the problem of Ancestral State Reconstruction (ASR).

For decades, detectives have used a specific set of mathematical rules (called "likelihood-based methods") to solve this. These rules work great when the crime scene is simple. But if the crime scene is messy, complex, or involves rules that don't fit the standard math (like how a virus spreads through a city), those old rules break down. They get stuck because the math becomes too hard to solve.

Enter the new detective: Deep Learning.

This paper introduces a new tool called PHYDDLE. Think of PHYDDLE not as a rule-follower, but as a super-smart student who learns by watching thousands of practice cases. Instead of trying to solve a complex equation, it looks at patterns in the data and says, "I've seen this pattern before; in 90% of those cases, the ancestor was this way."

Here is a breakdown of how the authors tested this new detective, using simple analogies:

1. The Training Camp (Simulation)

Before sending PHYDDLE out to solve real crimes, the authors had to train it. They created a massive "training camp" with 500,000 fake family trees and fake evolutionary histories.

The Analogy: Imagine a video game where you play thousands of levels to learn the rules. PHYDDLE played these evolutionary levels over and over, learning to guess the past based on the present.
The Challenge: The authors had to make sure the training games were diverse enough. If they only trained PHYDDLE on small, simple trees, it would be terrible at solving cases with huge, complex trees. They had to teach it to handle trees of all shapes and sizes.

2. The Test Drive (Simple vs. Complex)

The authors put PHYDDLE to the test in two scenarios:

Scenario A: The Simple Crime (Small Trees, Simple Rules)
- The Setup: A small family tree with just a few branches and simple rules (like a coin flip determining a trait).
- The Result: PHYDDLE performed almost perfectly, matching the results of the old, trusted mathematical methods.
- The Takeaway: For simple cases, the new AI detective is just as good as the old-school math detective.
Scenario B: The Complex Crime (Big Trees, Messy Rules)
- The Setup: Huge family trees with hundreds of branches, or complex rules where traits change depending on how fast species are born or die (like the Ebola virus spreading).
- The Result: PHYDDLE still did a decent job, but it started to make more mistakes than the old math methods. As the trees got bigger, the AI got a bit "confused."
- The Takeaway: The AI is great, but it's not magic. It struggles when the family tree gets too big or the rules get too complicated, likely because it hasn't seen every possible variation of a huge tree during training.

3. Real-World Cases (The Empirical Tests)

Finally, they used PHYDDLE on two real-life mysteries:

Case 1: The Lizards of South America (Liolaemus)
- The Mystery: Did these lizards evolve in the high mountains (Andes) or the lowlands?
- The Result: PHYDDLE's guess was very similar to the traditional method. It successfully mapped out where the lizard ancestors likely lived, showing that the AI can handle real biological data.
Case 2: The 2014 Ebola Outbreak
- The Mystery: Where did the virus start, and how did it move between different districts in Sierra Leone?
- The Twist: This is a "hard" problem. The virus spreads in a way that doesn't have a simple math formula (likelihood) to solve it. Traditional methods struggle here.
- The Result: PHYDDLE was able to reconstruct the virus's journey. It correctly guessed that the outbreak likely started in the eastern region (State 0) and spread outward. This is a huge win because it solved a problem that was previously very difficult to crack with standard math.

The Verdict: What Does This Mean?

Think of Likelihood-based methods as a calculator. It's incredibly precise and accurate, but it can only solve problems where you can write down a clear equation. If the equation is too messy, the calculator gives an error.

Think of Deep Learning (PHYDDLE) as a human expert. It might not be 100% perfect on every single calculation, but it can look at a messy, complex situation and make a very good guess based on experience and pattern recognition.

The Bottom Line:
This paper shows that we can now use AI to solve evolutionary mysteries that were previously impossible to crack because the math was too hard. While the AI isn't perfect yet (it gets a bit less accurate on very large trees), it opens the door to studying complex biological processes—like how diseases spread or how species adapt to changing environments—without getting stuck on the math.

It's like giving evolution a new pair of eyes that can see patterns in the chaos, helping us understand the history of life on Earth a little better.

1. Problem Statement

Ancestral State Reconstruction (ASR) is a fundamental task in phylogenetics used to infer the traits of extinct ancestors based on observed data at the tips of a phylogenetic tree.

The Limitation of Current Methods: Traditional ASR relies on likelihood-based methods (e.g., Maximum Likelihood or Bayesian inference). These require a tractable likelihood function. While simple models (like standard Markov models) have exact likelihoods, many biologically realistic models (e.g., complex Speciation-Extinction models, SIR epidemiological models with migration) result in intractable likelihoods.
The Gap: For models without tractable likelihoods, researchers often resort to approximations or cannot perform ASR at all.
The Challenge for Deep Learning: While deep learning offers a "likelihood-free" alternative, applying it to ASR is difficult because:
1. Tree topologies vary in size and shape, making it hard to define consistent input/output structures.
2. Internal nodes in different trees do not correspond to the same phylogenetic positions, complicating supervised learning.
3. Generating sufficiently diverse training data that covers the space of possible tree topologies and branch lengths is non-trivial.

2. Methodology

The authors modified the existing phylogenetic deep learning software PHYDDLE (Landis and Thompson, 2025) to perform ASR for discrete characters.

A. Data Representation and Encoding

Tree Encoding: Trees are converted into tensors using Compact Bijecitve Ladderized Vector (CBLV) or Compact Diversity-reordered Vector (CDV) encodings. These methods rotate descendants based on sample ages or branch lengths to reduce the number of patterns the network must learn.
Tip States: Encoded using modified CBLV+S or CDV+S formats with zero-padding to handle variable tree sizes up to a maximum limit.
Internal Nodes: Nodes are indexed via in-order traversal. For a tree with $N$ tips, there are $N-1$ internal nodes to estimate.

B. Estimation Strategies

The authors evaluated three distinct strategies for training neural networks:

Marginal Estimation: The network estimates the probability distribution for each of the $N-1$ internal nodes independently. Each node is treated as a separate categorical variable.
Joint Estimation: The network estimates the probability of all possible combinations of internal node states simultaneously as a single variable with $S^{(N-1)}$ states (where $S$ is the number of character states). This is computationally expensive and scales poorly with tree size.
Single Node Estimation: The network is trained to estimate the state of a specific node (identified by name). To estimate all nodes, the process is repeated for each node individually.

C. Handling Complex State Transitions (Triplet Strategy)

For models where states can change at speciation events (e.g., GeoSSE), the authors introduced a Triplet Strategy. Instead of estimating a single state per node, the network estimates a triplet: (Parent State $\to$ Left Daughter State, Right Daughter State). This is encoded as a single categorical variable with $S^3$ states.

D. Training and Simulation

Loss Function: Cross-entropy loss was used for classification.
Simulation: Training data was generated using simulators like CASTOR and DIVERSITREE for various models:
- Binary Markov models.
- State-dependent Speciation and Extinction (BiSSE, GeoSSE).
- SIR models with migration (SIRM) for epidemiology (no known likelihood).
Comparison: Performance was benchmarked against Bayesian inference (using RevBayes) where tractable, treating Bayesian results as the "gold standard" for accuracy.

3. Key Contributions

Implementation of ASR in PHYDDLE: Successfully adapted a likelihood-free deep learning pipeline to reconstruct ancestral states for internal nodes, not just root nodes.
Evaluation of Strategies: Systematically compared Marginal, Joint, and Single Node estimation strategies, identifying Marginal Estimation as the most robust balance of accuracy and scalability.
Handling Intractable Models: Demonstrated the application of deep learning to models like GeoSSE and SIRM, which are difficult or impossible to analyze with standard likelihood-based ASR.
Empirical Validation: Applied the method to two real-world datasets:
- Biogeography: Ancestral ranges of Liolaemus lizards (Andean vs. Lowland).
- Epidemiology: Ancestral locations of the 2014 Ebola virus outbreak in Sierra Leone.

4. Results

Performance on Simulated Data

Small Trees (4–50 tips): For simple Markov models, PHYDDLE's performance (both point estimates and probability distributions) closely matched Bayesian inference. The correlation between PHYDDLE and Bayesian probabilities was very high ( $r > 0.95$ ).
Tree Size Effect: As tree size increased (100–200 tips), the accuracy of PHYDDLE relative to Bayesian inference declined.
- Deep nodes were inferred less accurately than shallow nodes.
- The gap in accuracy widened with larger trees, likely due to increased topological complexity and the difficulty of generalizing across unseen topologies.
Model Complexity:
- BiSSE: PHYDDLE performed well for shallow nodes but showed higher variance and occasional "confidently wrong" inferences for deep nodes compared to Bayesian methods.
- GeoSSE: PHYDDLE showed a bias toward inferring single-region states (the most common state in training data) when the true state was widespread. However, overall accuracy was comparable to Bayesian inference (approx. 64% vs 65% correct).
Variable Tree Sizes: Networks trained on variable tree sizes performed similarly to those trained on fixed sizes, suggesting the method can generalize across tree sizes if the training distribution is broad enough.

Empirical Applications

Liolaemus Lizards: PHYDDLE and Bayesian inference produced highly concordant results for most nodes. Discrepancies occurred mainly in deep nodes with high tip-state variation. PHYDDLE tended to infer higher probabilities for Andean ranges in deep nodes compared to Bayesian methods.
Ebola Virus (SIRM Model):
- The model successfully inferred ancestral locations consistent with epidemiological data (e.g., deep nodes in Region 0, the likely origin).
- Some nodes were inferred to be in regions not represented by their descendants, a known limitation of the model's training data or the complexity of the SIRM dynamics.
- Different independent training runs yielded slightly varying results, suggesting the need for ensemble averaging.

5. Significance and Discussion

Bridging the Likelihood Gap: The study proves that deep learning is a viable alternative for ASR in models where likelihood functions are intractable (e.g., complex epidemiological or macroevolutionary models).
Trade-offs:
- Method Error vs. Model Error: Likelihood-based methods have low method error but high model error if the model is too simple. Deep learning has higher method error (approximation error) but allows for the use of highly realistic models (low model error).
- Data Requirements: Deep learning requires massive, diverse training datasets. If the training data does not represent the empirical system (e.g., specific parameter ranges or topologies), performance degrades.
Future Directions:
- Architecture: Moving from standard CNNs to Graph Neural Networks (GNNs) could better capture the explicit tree structure and improve correlations between adjacent nodes.
- Bias Correction: Techniques like re-weighting loss functions could mitigate biases toward common states (e.g., in GeoSSE).
- Hybrid Approaches: Using deep learning to estimate parameters first, then feeding them into a likelihood-based ASR, or using summary statistics as additional inputs.

Conclusion: While not yet a perfect replacement for Bayesian inference on simple models, PHYDDLE provides a powerful, flexible framework for ancestral state reconstruction in complex, biologically realistic scenarios where traditional statistical methods fail.