Joint likelihood-free inference of the number of selected single nucleotide polymorphisms and the selection coefficient in an evolving population

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery in a bustling city (a population of organisms). The city is changing over time: some people are moving faster, others are staying put, and the crowd is shifting. Your job is to figure out why the crowd is moving. Is it because of a single, charismatic leader (one strong mutation) pulling everyone along? Or is it because a whole group of people decided to move together (multiple mutations working in tandem)?

This paper presents a new detective tool to solve that mystery, specifically for scientists studying how populations evolve in the lab.

Here is the breakdown of the paper in simple terms:

1. The Problem: The "Black Box" of Evolution

In the past, scientists tried to figure out evolution by looking at the math behind it. But the math for how genes change over time is incredibly complex—like trying to predict the exact path of every single raindrop in a storm. It's too messy to calculate directly.

Because the math is too hard, scientists usually use a "shortcut." They look at the data and guess what caused it. However, most of these shortcuts have a blind spot: they assume only one person (one gene) is doing the leading. They miss the possibility that a whole team is working together. If a team is working together, the old tools get confused and might think one person is super powerful when, in reality, the power is shared among many.

2. The Solution: The "Simulation Game"

The authors propose a new method called Likelihood-Free Inference. Think of this as playing a massive game of "Guess Who?" using a video game simulator.

Instead of trying to solve the impossible math equation, they do this:

Make up a story: They guess, "Maybe 2 people are leading, and they are moving at this speed."
Run the simulation: They use a computer to simulate a whole population evolving based on that story.
Compare: They look at the simulated crowd and compare it to the real crowd they observed in the lab.
Repeat: They do this thousands of times, changing the story (how many leaders? how fast are they moving?) until they find the story that looks most like the real data.

This is called Approximate Bayesian Computation (ABC). It's like trying to find the right key for a lock by trying thousands of keys until one fits, rather than trying to pick the lock with a math formula.

3. The New Twist: Counting the "Leaders"

The real innovation here is what they are counting.

Old methods: "There is a leader! Let me guess how fast they are running." (They assume there is only 1 leader).
This paper's method: "Let's count how many leaders there are first, then guess how fast they are running."

They can tell if the crowd is moving because of one strong gene or two (or more) genes working together. This is crucial because in nature, evolution often happens because many small changes happen at once, not just one big change.

4. The "Energy Score" (The Ruler)

How do they decide if their simulated crowd looks like the real one?
Usually, scientists measure the distance between two things (like how far apart two points are on a map). But here, the data is complex—it's a whole pattern of movement over time.

The authors use a fancy metric called the Expected Energy Score.

Analogy: Imagine you have a bag of marbles (your real data) and you want to see if a bag of marbles you made up (your simulation) feels the same.
Instead of just measuring the distance between one marble and another, you look at the overall "vibe" or "shape" of the whole bag. Does the weight distribution feel right? Does the spread look similar?
This "Energy Score" helps them compare the entire pattern of evolution, not just single points, making the comparison much more accurate.

5. The Test Drive: Yeast and Fruit Flies

The authors tested their new detective tool in two ways:

Fake Data: They created computer simulations where they knew the answer (e.g., "We made 2 leaders"). They ran their tool and asked, "Did you find 2 leaders?" The tool got it right most of the time, especially when the leaders were moving fast.
Real Data: They looked at a real experiment with yeast (tiny fungi) that were evolving in a lab.
- When they looked at all the yeast samples together, the tool said, "Nothing is happening."
- But when they looked closer, they realized that only 2 out of 12 yeast groups were actually evolving strongly. The other 10 were just drifting randomly.
- Once they focused on just those 2 active groups, the tool successfully identified that two genes were working together to help the yeast adapt.

The Big Takeaway

This paper gives scientists a better magnifying glass. Instead of just seeing "something is changing," they can now see how many things are changing and how strong those changes are.

It's like moving from a blurry photo where you just see a crowd moving, to a high-definition video where you can count exactly how many people are leading the charge and how fast they are running. This helps us understand the "architecture" of evolution—whether it's a solo act or a team effort.

1. Problem Statement

In population genetics, detecting natural selection from time-series genomic data (e.g., evolve-and-resequence experiments) is challenging due to the intractability of exact likelihood functions caused by complex genealogical histories and linkage disequilibrium.

Limitations of Current Methods: Existing approaches typically infer selection strength at the level of individual Single Nucleotide Polymorphisms (SNPs). They often assume a single selected site per genomic region. This assumption ignores the biological reality that adaptation can result from the combined action of multiple linked loci.
The Core Challenge: When multiple linked loci are under selection, standard SNP-by-SNP methods can produce misleading results, such as overestimating the effect size of a single variant or failing to distinguish between a single strong signal and multiple weaker signals. Furthermore, these methods struggle to quantify the "selective architecture" (i.e., how many loci are actually driving the adaptation).
Goal: The authors aim to develop a framework that jointly infers:
1. The number of selected loci ( $n_{sel}$ ) within a genomic window.
2. The corresponding selection coefficients ( $s_1, \dots, s_{n_{sel}}$ ).
3. The uncertainty of these estimates via a posterior distribution.

2. Methodology

The authors propose a Likelihood-Free Inference (LFI) framework based on Approximate Bayesian Computation (ABC), specifically utilizing the Population Monte Carlo ABC (PMC-ABC) algorithm.

A. Simulation Model

Underlying Model: A discrete-time Wright-Fisher model with selection.
Implementation: Simulations are performed using MimiCREE2 software.
Scenarios: The model supports both haploid and diploid organisms, with options for varying effective population sizes ( $N_e$ ), recombination rates, and selection strengths.
Data Structure: The simulator generates time-series allele frequency data for multiple replicate populations over $K$ generations.

B. Summary Statistics

Instead of using raw allele frequencies, the authors derive specific summary statistics to capture selection signals efficiently:

Logit-Transformed Estimates: For each SNP and time interval, they compute an approximate selection coefficient ( $s$ $s$ ) using the logit transformation of allele frequencies (based on Taus et al., 2017).
- Haploid: $\ln(\frac{p_t}{1-p_t}) = \ln(\frac{p_0}{1-p_0}) + st$
- Diploid: $\ln(\frac{p_t}{1-p_t}) = \ln(\frac{p_0}{1-p_0}) + \frac{s}{2}t$
Vector Construction: These estimates are computed for all SNPs in a window and sorted in ascending order. This creates a high-dimensional vector that preserves both temporal and genomic structure.

C. Distance Metric (The Novelty)

A critical innovation is the choice of distance function to compare simulated and observed data.

Challenge: The summary statistics form a distribution over a function space (across replicates), not just a single point. Standard Euclidean distance on summary statistics is insufficient.
Solution: The authors use the Expected Energy Score (EES) as a distance metric.
- EES is a strictly proper scoring rule that measures the distance between two probability distributions (the distribution of summary statistics from observed replicates vs. simulated replicates).
- It effectively captures the "shape" and variability of the selection signals across replicates, making it robust to noise and capable of distinguishing between different numbers of selected loci.

D. Inference Algorithm

Algorithm: PMC-ABC (Beaumont, 2010b).
Process:
1. Draw parameters ( $\theta = \{n_{sel}, s_1, s_2\}$ ) from priors.
2. Simulate data.
3. Compute the EES distance between simulated and observed summary statistics.
4. Accept parameters if the distance is below a threshold $\epsilon$ .
5. Iteratively refine the parameter distribution by decreasing $\epsilon$ and using weighted resampling.
Priors:
- $n_{sel} \sim \text{Discrete Uniform}(\{0, 1, 2\})$ .
- $s_i \sim \text{Uniform}(0, 0.2)$ .

3. Key Contributions

Joint Inference of Architecture: Unlike previous methods that infer selection at a single SNP level, this approach explicitly estimates the number of selected loci within a window, providing insight into the polygenic nature of adaptation.
Novel Distance Metric: The application of the Expected Energy Score to ABC in population genetics allows for the comparison of distributions of high-dimensional summary statistics, handling the complexity of replicate data effectively.
Uncertainty Quantification: The Bayesian framework provides full posterior distributions, allowing researchers to quantify confidence in the estimated number of loci and their coefficients.
Handling Linkage: By modeling the joint distribution of selection signals, the method mitigates biases caused by linkage disequilibrium that plague single-SNP approaches.

4. Results

A. Simulation Studies

The method was tested on 20 simulated datasets under varying conditions (haploid/diploid, $N_e=500/1000$ , different selection strengths, and replicate numbers).

Accuracy in Haploids: The method reliably detected the correct number of selected loci ( $n_{sel}$ ) when selection was strong ( $s \ge 0.05$ ). Accuracy improved with higher $N_e$ and more replicates.
Accuracy in Diploids: Due to the additive model, selection coefficients needed to be roughly double those in haploids to achieve the same detection power. The method performed well when selection was sufficiently strong.
Recombination: The method remained robust across varying recombination rates (0 to 49 cM/Mb), though higher recombination slightly improved the ability to distinguish multiple loci.
Replicate Sensitivity: While 20 replicates provided the best performance, the method could still distinguish between 0, 1, and 2 selected loci with as few as 10 replicates. However, 5 replicates were insufficient for distinguishing between 1 and 2 loci.
Coefficient Estimation: When the correct model ( $n_{sel}$ ) was selected, the estimated selection coefficients were accurate. If the model was misspecified (e.g., inferring 1 locus when 2 existed), the coefficient estimates were biased.

B. Real Data Application (Yeast)

The method was applied to an outcrossing yeast dataset (Burke et al., 2014) involving 12 replicate populations over 540 generations.

Initial Analysis (All 12 Replicates): When analyzing all 12 replicates, the posterior probability strongly favored $n_{sel}=0$ (no selection) across all windows. This aligned with previous studies suggesting weak selection ( $s \approx 0.009$ ).
Re-analysis (Informative Replicates): The authors identified that only 2 of the 12 replicates showed strong, consistent selection signals (likely due to genetic redundancy where different replicates took different evolutionary paths).
Result with 2 Replicates: When the analysis was restricted to these 2 informative replicates:
- The method detected significant selection in the first 4 genomic windows.
- It inferred $n_{sel}=2$ for the first three windows and $n_{sel}=1$ for the fourth.
- Estimated selection coefficients were in the range of $0.012$ to $0.115$, providing a more granular view of the adaptive landscape than previous single-SNP methods.

5. Significance and Conclusion

Paradigm Shift: This work moves beyond the "single-SNP" paradigm in experimental evolution, offering a tool to characterize the genomic architecture of selection (i.e., is adaptation driven by one major gene or several linked genes?).
Robustness: The use of EES makes the method robust to the stochasticity inherent in small replicate numbers and complex linkage structures.
Practical Utility: The ability to quantify uncertainty and distinguish between different selective architectures provides evolutionary biologists with a more nuanced understanding of adaptation mechanisms.
Scalability: The authors demonstrate that the method is computationally feasible on high-performance computing clusters, suggesting it can be scaled to genome-wide studies.

In summary, the paper presents a sophisticated ABC framework that successfully addresses the "curse of dimensionality" and linkage issues in population genetics, enabling the joint inference of selection strength and the number of selected loci with quantifiable uncertainty.