LAML-Pro: Maximum Likelihood Inference of Cell Genotypes and Cell Lineage Trees

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to reconstruct the family history of a massive, sprawling family reunion that took place inside a petri dish. You have thousands of relatives (cells), and you want to draw a family tree showing who is related to whom and how they are connected.

In the past, scientists had a two-step process to do this, but it was like trying to solve a mystery with a blurry photograph:

Step 1 (The Blurry Photo): They looked at the cells and tried to guess their "genetic ID cards" (genotypes). Because the technology (like taking pictures of glowing cells) isn't perfect, they often made mistakes. Some IDs were blurry, some were missing, and some were just wrong.
Step 2 (The Wrong Tree): They took these flawed ID cards and tried to build a family tree. But if you feed a computer bad data, it builds a bad family tree. It might think two cousins are actually twins, or that a stranger is part of the family.

Enter LAML-Pro: The "Super-Detective" Algorithm

The paper introduces a new tool called LAML-Pro. Instead of doing the two steps separately, LAML-Pro does them at the same time. It acts like a super-detective who doesn't just look at the blurry photo and guess the ID; it looks at the photo and the family tree together, constantly adjusting both until they make perfect sense.

Here is how it works, using some everyday analogies:

1. The "Broken Puzzle" Analogy

Imagine you have a giant jigsaw puzzle (the family tree), but many pieces are missing, and the ones you have are smudged or upside down.

Old Method: You try to clean the smudges first (guess the genotypes). If you clean them wrong, the puzzle pieces won't fit together later, and you end up with a broken picture.
LAML-Pro: It holds the puzzle pieces and the picture of the final image in its mind simultaneously. If a piece looks like it should be a blue sky but the smudge makes it look like a green tree, LAML-Pro says, "Wait, if I put this piece here, the whole picture makes more sense. Let's assume the smudge was just a trick of the light." It fixes the smudge while building the picture.

2. The "Noisy Classroom" Analogy

Imagine a teacher trying to figure out which students are sitting next to each other based on a noisy recording of their voices.

The Problem: The recording is full of static (noise). Sometimes a student's voice is too quiet to hear (missing data), and sometimes the static makes a "hello" sound like "hullo."
The Old Way: The teacher writes down what they think they heard, then tries to arrange the students. If they misheard a word, they put the wrong students together.
LAML-Pro: The teacher listens to the whole room at once. They realize, "If Student A is sitting next to Student B, they would likely say similar things. Even though the recording of Student A is fuzzy, the fact that Student B is clear helps me guess what Student A actually said." By using the context of the whole group, it cleans up the noise.

3. The "Magic Eraser" for Mistakes

One of the biggest problems with imaging cells (taking pictures of them) is that the data is often "uncertain." It's like looking at a fingerprint in the rain; you see a shape, but you aren't 100% sure.

Old Methods: They would throw away the fuzzy fingerprints or guess randomly. This led to a lot of errors (up to 50% in some cases!).
LAML-Pro: It uses a special mathematical model (called PMMO) that understands why the data is fuzzy. It knows that sometimes a cell just "forgot" to show its ID (a dropout) or that the camera was too dim. Instead of giving up, it uses the surrounding clues to fill in the blanks.
- The Result: It reduced errors from a messy 25-50% down to a tiny 0.03%—basically making the blurry photos as clear as a high-definition scan.

Why Does This Matter?

In biology, knowing the family tree of cells is crucial for understanding how diseases like cancer grow or how a baby develops from a single cell.

Before: Scientists were building family trees on shaky ground, leading to wrong conclusions about how cells divide and move.
Now: With LAML-Pro, they can build a solid, accurate tree even when the data is messy. It's like upgrading from a sketch drawn in pencil to a high-definition 3D map.

In a Nutshell:
LAML-Pro is a smart computer program that stops trying to "clean the data" before "building the tree." Instead, it cleans the data while building the tree, using the logic of the whole family to fix the mistakes of the individual members. This allows scientists to see the true history of cell life, even when the evidence is fuzzy, missing, or confusing.

1. Problem Statement

Context: Dynamic Lineage Tracing (DLT) technologies use genome editing (e.g., CRISPR/Cas9, base editors) to induce heritable mutations in cells. These mutations accumulate over cell divisions, creating a record of cell history that can be measured via single-cell sequencing or fluorescence imaging to reconstruct cell lineage trees (phylogenies).

The Challenge: Current computational pipelines for reconstructing lineage trees operate in two distinct, sequential steps:

Genotyping: Inferring the discrete genotype (edit state) of each cell from raw data (sequencing reads or fluorescence pixel intensities).
Tree Inference: Reconstructing the lineage tree based on the inferred genotypes.

Limitations:

Error Propagation: Genotyping is an inexact process. Imaging-based methods, in particular, suffer from high rates of uncertainty and error (25–50% uncertain genotypes) due to noise in fluorescence signals and dropout events.
Information Loss: Standard pipelines often discard low-confidence genotype calls or missing data, leading to incomplete datasets.
Suboptimal Trees: Errors in the initial genotyping step propagate into the tree inference, resulting in inaccurate topologies and branch lengths. Existing methods assume genotypes are known inputs, failing to account for the uncertainty inherent in the observation process.

2. Methodology: LAML-Pro

The authors introduce LAML-Pro (Lineage Analysis via Maximum Likelihood with PRobabilistic Observations), an algorithm that jointly infers the cell lineage tree and the cell genotypes directly from raw observations, bypassing the intermediate step of discrete genotype calling.

A. The PMMO Model

LAML-Pro is built upon the Probabilistic Mixed-type Missing with Observations (PMMO) model, which integrates the genome editing process with a generative model of the observed data.

Hidden States ( $Z$ ): Represent the true genotype at $K$ genomic sites. The alphabet includes unedited states (0), edited states ( $1 \dots M$ ), and a heritable missing state (-1), representing epigenetic silencing or resection.
Editing Process: Modeled as a Continuous-Time Markov Chain (CTMC) with a transition rate matrix $Q$ . It enforces non-modifiability (once edited, a site cannot be re-edited) and irreversibility.
Observation Process ( $X$ ):
- Missing Data: Modeled via a "dropout" probability ( $\vartheta$ ) where an observation is missing ('?') even if the site is not silenced.
- Imaging/Sequencing: For imaging, the model uses kernel density estimators to map continuous pixel intensities to discrete hidden states. For sequencing, it models read counts.
- Key Feature: The model explicitly handles the relationship between the unknown true genotype $Z$ and the noisy observation $X$ , allowing for the marginalization of $Z$ during tree inference.

B. Optimization Algorithm

LAML-Pro solves a maximum likelihood optimization problem to find the tree topology $T$ , branch lengths $\omega$ , and model parameters $\Theta$ (editing rates, dropout rates, etc.).

Objective: Maximize the log-likelihood $\log L(T, \Theta; x)$ by marginalizing over all possible assignments of hidden genotypes $Z$ .
Heuristic Search: Uses a Simulated Annealing approach with Nearest Neighbor Interchange (NNI) moves to explore tree topologies.
Parameter Estimation: Employs an Expectation-Maximization (EM) algorithm:
- E-step: Computes posterior probabilities of hidden states and transitions using Felsenstein's pruning algorithm. Complexity is optimized to $O(NM)$ using matrix sparsity.
- M-step: Jointly updates all parameters (branch lengths, rates, dropout probabilities) using an interior-point method (IPOPT). This ensures fast quadratic convergence to stationary points, unlike the block coordinate ascent used in previous methods.
Constraints:
- Ultrametric Constraint: Enforces a strict molecular clock (all root-to-leaf distances are equal), appropriate for experiments where cells are sampled simultaneously.
- Minimum Branch Length: Prevents zero-length branches (biologically implausible instantaneous divisions) to improve numerical stability.

3. Key Contributions

Joint Inference Framework: First method to simultaneously infer lineage trees and genotypes from raw DLT data (imaging or sequencing), eliminating the error-prone intermediate genotyping step.
PMMO Model: A novel probabilistic model that unifies genome editing dynamics with complex observation noise (dropout, silencing, and measurement error).
Scalability: Despite the complexity of marginalizing over genotypes, LAML-Pro scales efficiently to thousands of cells (e.g., 3,108 cells in <18 hours) by leveraging matrix sparsity and efficient optimization.
Error Correction: Demonstrates the ability to "correct" genotype errors and impute missing data that would otherwise be discarded by standard pipelines.

4. Results

The authors evaluated LAML-Pro on simulated data and two real-world imaging-based datasets (PEtracer and baseMEMOIR).

A. Simulated Data

Accuracy: LAML-Pro significantly outperformed existing methods (LAML, Neighbor Joining, ConvexML) in reconstructing tree topologies, achieving a median normalized Robinson-Foulds (RF) distance of 0.03 (vs. 0.12–0.18 for others) under realistic missing data conditions.
Genotype Correction: It correctly inferred genotypes at 90% of sites, compared to 77% for baseline methods that pick the most likely genotype independently.
Branch Lengths: Achieved near-perfect correlation ( $R^2 = 0.995$ ) with true branch lengths, significantly outperforming other methods which degrade as observation noise increases.

B. Real-World Imaging Data (PEtracer)

Genotype Accuracy: Applied to 4T1 breast cancer cells, LAML-Pro reduced genotype error rates from imaging-based readouts (typically 25–50%) down to 0.3%, comparable to sequencing error rates.
Data Utilization: It eliminated missing genotypes by inferring dropout events and imputing silenced sites, whereas the standard PEtracer pipeline discarded ~22% of data.
Spatial Concordance: The LAML-Pro tree showed significantly higher correlation between phylogenetic distance and spatial cell coordinates ( $R=0.39$ ) compared to the published PEtracer tree ( $R=0.07$ ), indicating a more biologically accurate reconstruction of cell migration and division history.

C. Real-World Imaging Data (baseMEMOIR)

Handling Low Confidence: LAML-Pro utilized observations that the baseMEMOIR pipeline discarded due to low confidence.
Tree Quality: The LAML-Pro trees showed much higher genotype concordance with the underlying data (lower Expected Hamming Distance between siblings) and better spatial correlation ( $R=0.349$ ) than the baseMEMOIR trees.
Migration History: Under a Brownian motion migration model, LAML-Pro trees supported a higher likelihood ancestral migration history than the original trees.

5. Significance

Paradigm Shift: LAML-Pro moves the field from a "genotype-then-tree" pipeline to a unified "observation-to-tree" framework, acknowledging that genotypes are latent variables.
Feasibility of Imaging: By reducing genotype error rates to sequencing levels, LAML-Pro makes fluorescence imaging a viable, high-accuracy alternative to sequencing for lineage tracing, offering higher throughput and spatial resolution.
Robustness: The method is robust to high rates of missing data and noise, enabling the reconstruction of lineage trees from datasets that were previously considered too noisy or incomplete.
Open Source: The tool is freely available, facilitating broader adoption in developmental biology and cancer research.

In summary, LAML-Pro represents a major advancement in computational lineage tracing by mathematically integrating observation noise into the tree inference process, resulting in more accurate, complete, and biologically coherent cell lineage trees.