ARGformer: learning on ancestral recombination graphs with transformers

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine your DNA isn't just a long, static string of letters (A, C, T, G). Instead, think of it as a massive, tangled family tree that has been growing, splitting, and recombining for thousands of years. Every time your ancestors had children, their family trees merged. Every time they moved to a new continent or mixed with a different group, new branches were added.

Scientists call this tangled history an Ancestral Recombination Graph (ARG). It's the ultimate "who is related to whom and when" map for the entire human species.

The problem? This map is so huge and complex that it's impossible for humans to read. It's like trying to understand the history of the entire internet by looking at a single, unorganized pile of every email ever sent.

Enter ARGformer.

What is ARGformer?

Think of ARGformer as a super-smart librarian who has read every single page of that massive family history book. But instead of reading the text word-for-word, it learns the structure of the stories.

It uses a type of AI called a Transformer (the same technology behind tools like ChatGPT) to look at these family trees and turn them into digital fingerprints (called "embeddings").

Here is how it works, broken down with simple analogies:

1. The "Masked" Game (Learning the Rules)

Imagine you are playing a game where you have to guess a missing word in a sentence.

The Input: The AI looks at a path from a person's DNA back to their ancient ancestors.
The Trick: It hides (masks) a few steps in that family tree and asks, "Based on the rest of the path, what ancestor should be here?"
The Result: By playing this guessing game millions of times, the AI learns the "grammar" of human history. It learns that if you see a certain pattern of ancestors, it usually means you come from Africa, or Europe, or that you have a bit of Neanderthal in you. It does this without ever looking at the raw DNA letters, just the shape of the family tree.

2. The "Group Hug" (Fine-Tuning)

Once the AI has learned the grammar, we teach it to be a better sorter.

We show it examples of people from specific groups (like "African," "East Asian," or "Oceanian").
We tell the AI: "Make the digital fingerprints of people in the same group look very similar (like a group hug), and make the fingerprints of different groups look very different."
Now, the AI can take a new, unknown DNA sample, turn it into a fingerprint, and instantly see which "group hug" it belongs to.

What Did They Discover?

The researchers tested this AI on two big challenges:

1. Finding the "Ghost" Ancestors (Denisovans)
In the Pacific Islands (Oceania), people have DNA from an ancient human group called Denisovans (cousins to Neanderthals) that no longer exist today.

The Test: The AI looked at the family trees of Oceanian people.
The Result: Even without being told "look for Denisovans," the AI found specific segments of the family tree that looked suspiciously like the ancient Denisovan branches. It successfully highlighted the parts of the genome where Oceanians "hugged" their ancient cousins.

2. The Mystery of South American Ancestry
For a long time, scientists noticed that some Indigenous groups in the Amazon (like the Suruí and Karitiana) seemed to have a tiny, mysterious connection to people from Oceania (Australia/New Guinea), even though they are thousands of miles apart.

The Test: The AI looked at the family trees of South American tribes.
The Result: It confirmed the suspicion. While most of their family trees looked like typical South American/East Asian trees, the AI found specific "branches" in the Suruí and Karitiana trees that were surprisingly close to Oceanian trees. It's like finding a few pages in a South American history book that were written in an Oceanian dialect.

Why Does This Matter?

Before ARGformer, to understand population history, scientists had to look at the raw DNA letters (genotypes) and use heavy math to guess the history. It was like trying to understand a movie by looking at a spreadsheet of every pixel in every frame.

ARGformer changes the game.

It skips the pixel spreadsheet and looks directly at the story structure (the family tree).
It compresses billions of years of history into a simple, easy-to-read map.
It allows us to find hidden connections (like the Oceanian link in South America) that were previously too subtle to see.

The Bottom Line

ARGformer is a tool that turns the chaotic, tangled mess of human family history into a clear, organized map. It proves that by understanding how our ancestors are related (the graph), rather than just what their DNA says (the letters), we can uncover secrets about our past that were previously invisible. It's like finally having a GPS for the human family tree.

1. Problem Statement

Population genetics has traditionally relied on dimensionality reduction techniques (e.g., PCA, UMAP) applied directly to genotype matrices (SNP data) to visualize population structure. While recent advances allow for the inference of Ancestral Recombination Graphs (ARGs) at a genome-wide scale for large cohorts, there is a significant gap in methodology for effectively summarizing and utilizing this complex topological data.

The Challenge: ARGs encode the full history of recombination, mutation, and shared lineages, but they are ultra-large, high-dimensional graphs. Existing deep learning models (like VAEs) operate on genotype matrices, ignoring the underlying genealogical history. There is no standard, scalable, self-supervised framework for representation learning directly on genome-wide genealogies or ARG-derived structures.
The Goal: Develop a method to learn dense, context-dependent embeddings directly from ARGs that capture global population structure and local ancestry without requiring access to raw genotype matrices.

2. Methodology

The authors propose ARGformer, an encoder-only transformer architecture designed specifically to process ARGs.

A. Data Representation (Tokenization)

Instead of attempting to encode the entire ARG or full marginal trees (which are computationally prohibitive), ARGformer encodes leaf-to-root paths within the marginal coalescent trees.

Path Sequences: For each extant haplotype, the path from the leaf (sample) to the root (most recent common ancestor) is treated as a sequence of tokens, where each token represents a unique node identifier in the genealogy.
Context Preservation: Since different paths share internal ancestral nodes and adjacent trees along the genome share topological structures, this path-based representation retains substantial genealogical context while remaining scalable.
Positional Encodings: The model incorporates positional encodings that reflect the ordering of coalescence events and local tree topology.

B. Model Architecture

ARGformer is built upon a ModernBERT-style encoder, incorporating recent architectural improvements for efficiency.

Self-Supervised Pretraining: The model uses a masked-node objective (analogous to Masked Language Modeling in NLP).
- Randomly masks node tokens along each path (probability $p_{mask} = 0.30$ ).
- Optimizes a cross-entropy loss to predict the masked nodes based on the surrounding context.
- This forces the model to learn the statistical dependencies of coalescent events and recombination history.
Contrastive Finetuning: After pretraining, the model is finetuned using a supervised contrastive loss (InfoNCE).
- Embeddings of sequences sharing the same population label are pulled together.
- Embeddings from different labels are pushed apart.
- This step sharpens the separation of distinct local-ancestry clusters for downstream tasks.

C. Training Data

The model was trained on two types of data:

Simulated Data: Coalescent simulations using msprime with an "Out-of-Africa" demography (including admixed populations).
Empirical Data: ARGs inferred from ancient and present-day Homo sapiens genomes (using tsinfer+tsdate), including archaic hominins (Neanderthals, Denisovans).

3. Key Contributions

First Self-Supervised Framework for ARGs: ARGformer is the first model to apply transformer-based representation learning directly to inferred genome-wide genealogies, bypassing the need for genotype matrices.
Path-Based Encoding Strategy: It introduces a novel tokenization scheme that converts complex graph structures into manageable path sequences, leveraging shared ancestral nodes to capture evolutionary history.
Genotype-Free Analysis: The method demonstrates that population structure and ancestry can be inferred solely from learned embeddings of genealogical topology, decoupling analysis from raw sequence data.
Scalability: By focusing on paths rather than the full graph, the approach scales to large cohorts where full ARG processing is infeasible.

4. Results

A. Capturing Global Population Structure

Visualization: When applied to simulated admixed data, ARGformer embeddings (visualized via PCA) successfully separated individuals into continental clusters (African, European, East Asian) and correctly positioned admixed individuals between them.
Ablation Study: Self-supervised pretraining alone captured inherent population structure. Contrastive finetuning further sharpened the separation of local-ancestry clusters.
Genealogical Depth: The embeddings inherently encode "genealogical depth." A ridge probe trained on frozen embeddings could accurately predict the number of coalescent events along a path ( $R^2 \approx 0.65$ ), confirming the model learns structural properties of the tree, not just class separability.

B. Local Ancestry Inference (LAI)

Benchmarking: ARGformer was tested against FLARE, a state-of-the-art local ancestry inference tool.
Performance: Using only embeddings, ARGformer achieved precision and recall comparable to FLARE (e.g., ~98% precision for European ancestry).
Strategies: Two strategies were effective:
1. PCA Clustering: Projecting embeddings into a reference PC space.
2. Nearest-Neighbor Retrieval: Assigning ancestry based on the labels of the top- $k$ nearest neighbors in embedding space. The retrieval method performed slightly better.

C. Detection of Archaic Introgression

Denisovan Ancestry: In Oceanian genomes (specifically Papuan Highlands), ARGformer embeddings identified segments with nearest neighbors in Denisovan reference populations. The retrieval showed a significant enrichment (3.60% Denisovan neighbors vs. <1.5% in non-Oceanian controls), consistent with known Denisovan introgression.
South American Ancestry: The model detected subtle Oceanian-like ancestry in specific Indigenous South American populations (Suruí and Karitiana). These groups showed a higher retrieval rate of Oceanian neighbors (~~9%) compared to other Indigenous American groups (~~4-5%), echoing previous findings of "Australasian" signals in the Amazon.

5. Significance and Future Directions

Paradigm Shift: ARGformer shifts the focus from analyzing static genotype matrices to analyzing the dynamic genealogical history encoded in ARGs. It treats the ARG as the primary data source for learning evolutionary relationships.
Interpretability: The model provides a low-dimensional latent space where specific genomic segments can be queried to reveal localized demographic events (e.g., introgression, admixture) without manual parsing of the graph.
Future Applications: The framework is expected to extend to detecting demographic bottlenecks, signatures of selection, and scaling to biobank-sized cohorts. Future work may explore defining tokens based on shared lineages over hundreds of generations to further compress the data.
Limitations: The method relies on the quality of the inferred ARG (which can be noisy with high variant density) and currently focuses on individual-level ancestry rather than global graph-level tasks.

In summary, ARGformer successfully bridges the gap between deep learning and population genetics by creating a powerful, self-supervised representation of ancestral recombination graphs, enabling accurate ancestry inference and demographic discovery directly from genealogical topology.