Popformer: Learning general signatures of positive selection with a self-supervised transformer

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: Finding the "Fingerprints" of Evolution

Imagine the human genome as a massive, ancient library containing the history of our species. Every book in this library is a person's DNA. Over thousands of years, nature has been editing these books. Sometimes, a specific change (a mutation) helps a person survive better—like giving them a superpower to digest milk or fight off a specific disease. When this happens, that "superpower" spreads quickly through the population.

Scientists call this Natural Selection. The goal of this research is to find the "fingerprints" or "signatures" left behind in the DNA library when these superpowers spread.

The Problem: The Library is Messy

The problem is that the library is incredibly messy.

Noise: Sometimes, the books look different just because of random chance (like a typo that happened by accident), not because of a superpower.
Confusion: Different populations have different histories. A pattern that looks like a superpower in one group might just be a random accident in another.
Old Methods: Traditional tools used to find these signatures are like using a magnifying glass to read a book in the dark. They often miss the good stuff or get confused by the noise.

Recently, scientists started using Deep Learning (AI) to solve this. They trained computers on simulated DNA (fake DNA created by computers) to teach them what a "superpower" looks like. But here's the catch: Simulations are like video game physics. They are simplified versions of reality. When you train an AI on a video game, it often fails when you put it in the real world because the real world is too complex and messy.

The Solution: Popformer (The "Genetic Translator")

The authors of this paper built a new AI model called Popformer. Think of it as a highly advanced translator that learned to speak "Genetic" by reading real books from the library, not just fake ones from a video game.

Here is how they built it, using a simple analogy:

1. The "Fill-in-the-Blank" Game (Pre-training)

Before teaching Popformer to find superpowers, the authors taught it to understand how DNA works in general.

The Analogy: Imagine you have a sentence with 75% of the words covered up with black tape. Your job is to guess the missing words based on the context of the sentence.
In the Paper: They took real human DNA data and hid (masked) random pieces of it. They forced the AI to guess what the missing DNA letters were.
The Result: By playing this game millions of times, Popformer learned the "grammar" and "vocabulary" of human DNA. It learned how genes usually sit next to each other, how populations differ, and what normal variation looks like. This is called Self-Supervised Learning.

2. The "Super-Reader" Architecture (The Transformer)

Most AI models look at DNA like a long, flat line of text. Popformer is different; it uses a Transformer architecture (the same tech behind tools like ChatGPT).

The Analogy: Imagine a detective looking at a crime scene.
- A normal detective looks at one clue at a time.
- Popformer looks at the entire room at once. It can see how a clue in the corner relates to a clue on the ceiling, and how a clue in one person's DNA relates to a clue in another person's DNA.
The Tech: It uses "Axial Attention." It looks across the DNA strands (SNPs) and across the different people (Haplotypes) simultaneously. This allows it to spot complex patterns that other models miss.

3. The "Specialist" Training (Fine-Tuning)

Once Popformer was a master of reading DNA, the authors gave it a specific job: Find the Superpowers.

They showed it simulated examples of "superpowers" (selection) and "no superpowers" (neutral).
Because Popformer already understood the "grammar" of DNA from the first step, it only needed a little bit of extra training to become an expert detective.

Why This is a Big Deal

The paper tested Popformer in three ways:

The Simulation Test: They tested it on fake data it had never seen before (different populations, different history).
- Result: Popformer was much better at guessing the right answer than the old methods. It didn't get confused when the "rules" of the simulation changed.
The Real World Test: They applied it to real human data (from the 1000 Genomes Project).
- Result: It successfully found known superpowers (like the ability to digest milk in Europeans) that other AI models missed.
The "Generalization" Test: This is the most important part. They trained the AI on European data but tested it on African and Asian data.
- Result: Most AI models failed here because they memorized the European patterns. Popformer, having learned the general rules of DNA first, could adapt and find superpowers in the other groups too.

The Takeaway

Think of previous AI models as students who memorized the answers to a specific practice test. If the real test has different questions, they fail.

Popformer is like a student who first learned the principles of the subject (by reading real textbooks) and then took a practice test. Because it understands the underlying rules, it can solve problems it has never seen before.

In short: The authors created an AI that learns the "language" of human evolution from real data first, making it a much more robust and accurate detective for finding how humans adapted to their environments. This opens the door to finding new evolutionary secrets in any human population, not just the ones we simulated.

1. Problem Statement

The detection of natural selection (specifically selective sweeps) in genomic data is a central challenge in population genetics. Traditional methods rely on summary statistics (e.g., Tajima's D, iHS, $\pi$ ), which are often underpowered and unreliable when confounding evolutionary forces (like demographic history, background selection, or varying mutation rates) produce similar genetic signatures.

Recent deep learning (DL) approaches, primarily Convolutional Neural Networks (CNNs), have improved detection power but suffer from significant limitations:

Poor Generalization: Models trained on specific demographic simulations often fail to generalize to real-world data or populations with different demographic histories (out-of-distribution data).
Simulation Dependency: They rely heavily on supervised training with simulated data, which are simplifications of complex evolutionary processes.
Lack of Robustness: They struggle when the training simulation parameters do not match the true underlying population history.

The authors aim to develop a method that learns general signatures of variation from real data, reducing reliance on specific simulation assumptions and improving robustness to demographic misspecification.

2. Methodology: Popformer

The authors propose Popformer, a novel transformer-based model designed to learn general patterns of genetic variation through a two-stage training regime: Self-Supervised Pre-training followed by Supervised Fine-tuning.

A. Architecture

Popformer is an encoder-only transformer that processes haplotype matrices ( $n$ haplotypes $\times$ $S$ SNPs).

Axial Attention: Unlike standard 1D sequence transformers, Popformer utilizes 2D axial attention.
- Column-wise (SNP-wise) Attention: Computes attention across haplotypes for a specific SNP.
- Row-wise (Haplotype-wise) Attention: Computes attention across SNPs for a specific haplotype.
- Tied Row Attention: To reduce memory complexity from $O(S^2n)$ to $O(S^2)$ , the model shares attention maps across haplotypes, forcing it to learn a single pattern of SNP dependency applicable to all individuals.
Positional Embeddings: Standard transformers assume equal token spacing. Popformer incorporates learned relative positional embeddings for inter-SNP distances (in base pairs). Distances are binned (0–50kb), and a bias is added to the attention matrix, allowing the model to learn variant density patterns.
Input: Binary matrices (0 = major allele, 1 = minor allele) and a distance vector.

B. Training Regime

Self-Supervised Pre-training (Popformer-base):
- Objective: Masked Language Modeling (MLM) adapted for genomics.
- Data: Real human genomic windows from the 1000 Genomes Project (26 populations).
- Task: Randomly mask 75% of positions in the haplotype matrix and train the model to predict the masked alleles. This forces the model to learn complex dependencies between SNPs and haplotypes without supervision.
- Outcome: A pre-trained model (Popformer-base) that captures the intrinsic structure of genomic variation.
Supervised Fine-tuning:
- Task: Selection classification (detecting positive selection vs. neutral regions).
- Data: Simulated datasets based on human-inferred demographies (CEU, CHB, YRI) using SLiM.
- Strategies Evaluated:
  - Popformer-lp (Linear Probe): Freeze the pre-trained encoder; train only a linear classification head.
  - Popformer-ft (Full Fine-tuning): Update weights of both the encoder and the head.
  - Popformer-no-pretrain: Train a randomly initialized model from scratch (ablation study).

3. Key Contributions

Novel Architecture: Introduction of a transformer with axial attention and learned distance embeddings specifically tailored for haplotype matrices, allowing variable input sizes without padding.
Self-Supervised Paradigm: Demonstrating that pre-training on real genomic data (via masked modeling) creates robust representations that generalize better than models trained solely on simulations.
Validation Framework: Proposing a novel validation approach for selection detection that moves beyond "cherry-picked" known regions. They use:
- True Negatives: Putatively neutral regions derived from ancient DNA allele frequency trajectories.
- Pseudo-ROC Curves: Comparing the fraction of the genome predicted as selected against known positive lists to estimate false discovery rates.
Open Resources: Release of pre-trained checkpoints, fine-tuned models, simulated datasets, and code to facilitate reproducibility and future applications (e.g., recombination rate inference).

4. Results

A. Pre-training Performance

Genotype Imputation: Popformer-base achieved high accuracy in unmasking tasks (95.8% accuracy), outperforming simple baselines (nearest neighbor, column frequency).
Imputation Metrics: In terms of $R^2$ for genotype dosage, Popformer performed comparably to IMPUTE5 (a state-of-the-art HMM method), validating that the model learns meaningful genomic structures.
Embedding Quality: PCA of learned embeddings successfully separated continental populations (EUR, EAS, AFR, etc.) and showed a clear gradient corresponding to selection strength in simulations, indicating the embeddings capture relevant biological signals.

B. Selection Detection Performance

In-Distribution (CEU simulations): Popformer (both lp and ft) significantly outperformed traditional summary statistics (Tajima's D) and other deep learning models (FASTER-NN, ResNet34), achieving an AUROC of 0.95.
Out-of-Distribution (OOD) Generalization:
- When tested on simulations with different demographic histories (CHB, YRI), Popformer maintained high performance, whereas other methods showed degradation.
- Robustness to Misspecification: In extreme scenarios (e.g., extremely strong bottlenecks), Popformer significantly outperformed all competitors, demonstrating superior robustness to confounding demographic signals.
Ablation Findings:
- Full fine-tuning (ft) generally outperformed the linear probe (lp), suggesting the pre-training task does not perfectly align with selection classification, but the pre-trained weights provide a strong foundation.
- Pre-training provided the most significant benefit when applied to real data, particularly in populations (YRI) that differed most from the training demographic (CEU).

C. Real Data Validation

Applied to 1000 Genomes data, Popformer successfully recovered known selection signatures (e.g., LCT region).
In the YRI (Yoruban) population, Popformer outperformed Tajima's D and other DL methods in recovering known positive regions, highlighting its ability to generalize from European-trained simulations to African real-world data.
The new validation method confirmed that while DL methods often over-predict selection on real data, Popformer offered the best balance of power and false positive control among the tested methods.

5. Significance and Future Directions

Paradigm Shift: Popformer shifts population genetics inference from purely simulation-dependent supervised learning to a self-supervised pre-training paradigm. This allows models to learn the "language" of real genomes before being specialized for specific tasks.
Generalizability: The method demonstrates that models can generalize across diverse evolutionary scenarios, addressing the "simulation gap" that has plagued previous DL approaches.
Scalability: The architecture supports variable numbers of individuals and SNPs, making it suitable for diverse population sizes without retraining.
Future Applications: The authors suggest this framework is "plug-and-play" for other population genetic tasks, including:
- Recombination rate inference.
- Introgression detection (archaic admixture).
- Local ancestry inference.
- Mutation rate estimation.

In conclusion, Popformer establishes a new benchmark for selection detection by leveraging self-supervised learning on real data to create robust, generalizable representations of genomic variation, effectively bridging the gap between theoretical simulations and complex real-world evolutionary history.