Bias in diversity estimators and neutrality tests induced by neutral polymorphic structural variants

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to figure out the history of a bustling city (your genome) by counting how many people have different colored hats (genetic variations).

Usually, you have a "standard rulebook" that tells you what the hat distribution should look like if everyone is just living their lives normally, without any special events. If the hat colors are distributed in a weird way, you might conclude, "Aha! Something dramatic happened here—maybe a war, a famine, or a massive party!"

The Problem: The "Structural Variant" Trap

This paper is about a specific trap that tricks these detectives. Sometimes, a whole neighborhood in the city gets a giant, invisible fence around it (a Structural Variant, or SV). This fence could be a missing block (Deletion), a new block added (Insertion), a street that got flipped around (Inversion), or a whole new neighborhood moved in from another city (Introgression).

The problem is that everyone inside this fenced neighborhood is stuck together. They all share the same history because of the fence. If you try to apply your standard "hat color rulebook" to this neighborhood, you will get the wrong answer. You might think a massive party happened, when in reality, it was just the fence distorting the view.

The Analogy: The Two-Party Dinner

Let's use a dinner party analogy to explain what the authors discovered:

The Standard Party (Neutral Evolution): Imagine a big dinner where everyone sits randomly. You count how many people are wearing red hats vs. blue hats. The distribution is predictable.
The Fenced Table (The SV): Now, imagine a large table is fenced off.
- Inversions: The people at this table are wearing hats, but the table is upside down. You can still see the hats, but the way they are arranged is weird because the whole table is flipped.
- Deletions: The fence is so high that half the people at the table are invisible. You only see the hats of the people who are still visible. It looks like there are fewer hats than there should be.
- Insertions: A new group of people has been added to the table, but they are all wearing brand new, unique hats that no one else has. It looks like there are too many rare hats.
- Introgressions: A group of people from a completely different city (a different species or population) joins the table. They have a different style of hat entirely. When you mix them with the locals, the hat distribution looks chaotic.

What the Paper Found

The authors (Ramos-Onsins, Ross-Ibarra, et al.) did the math to show exactly how these "fences" mess up the statistics.

The Bias: If you ignore the fence and just count hats, your "Neutrality Tests" (like Tajima's D) will scream "Evolutionary Drama!" when there is actually none.
- Example: If you have a Deletion (missing people), the math thinks the population recently shrank (a bottleneck), because there are fewer rare hats.
- Example: If you have an Insertion (new people), the math thinks the population recently exploded (expansion), because there are too many rare hats.
- Example: If you have an Inversion or Introgression, the math might think there was strong natural selection, because the hats are clumped in the middle frequencies.

The Solution: The "Fence-Aware" Calculator

The paper doesn't just point out the problem; it builds a new calculator.

Instead of using the standard rulebook, the authors created a custom rulebook for each type of fence.

If you know there is a Deletion, the calculator adjusts the numbers to account for the missing people.
If you know there is an Introgression, the calculator accounts for the "foreign" hats.

Why This Matters

In the past, scientists might have looked at a genome region with a structural variant and wrongly concluded that "Natural Selection is acting here!" or "The population went through a crash!"

This paper says: "Wait a minute. Before you call the press, check if there's a fence."

By using their new formulas, scientists can now look at the data, see the fence, and say, "Okay, the hat distribution looks weird, but that's just because of the fence. The population is actually doing just fine."

In a Nutshell

This paper is a guidebook for genetic detectives. It teaches them that Structural Variants (big chunks of DNA that are flipped, missing, added, or imported) act like optical illusions. They make the genetic data look like it's telling a dramatic story of selection or population change, when it's actually just a neutral trick of geometry. The authors provide the mathematical tools to see through the illusion and get the true story.

1. Problem Statement

Genetic diversity estimators (e.g., Watterson's $\theta_W$ , nucleotide diversity $\pi$ ) and neutrality tests (e.g., Tajima's $D$ , Fay and Wu's $H$ ) are fundamental tools in population genetics. They rely on the Site Frequency Spectrum (SFS) and assume a standard neutral model as a baseline. Under this model, the expected SFS follows a $1/k$ distribution (where $k$ is the derived allele count).

However, Polymorphic Structural Variants (SVs)—such as inversions, deletions, insertions, and introgressions—violate this assumption. When a genomic region is completely linked to a segregating SV, the evolutionary history of neutral mutations in that region is conditioned on the SV's allele. Even if the SV and linked mutations are evolving neutrally (without selection), the presence of the SV distorts the SFS. This leads to systematic biases in standard estimators and tests, potentially causing researchers to falsely infer selection or demographic changes. While this bias is well-understood for inversions under balancing selection, the specific biases induced by neutral SVs of various types (deletions, insertions, introgressions) have not been analytically derived.

2. Methodology

The authors developed an analytical framework to derive the exact conditional expectations of the SFS for neutral mutations linked to a biallelic SV.

Modeling Assumptions:
- Complete Linkage: All analyzed sites are perfectly associated with the SV allele (no recombination between the SV and the SNPs).
- Conditioning: The analysis conditions on the sample count ( $i$ ) of the SV allele within a sample of size $n$ .
- SV Types: The study covers four specific biallelic SV types:
  1. Inversions: Both arrangements contain the sequence; mutations can occur on either background.
  2. Deletions: The sequence is absent in the derived allele; mutations only occur on the ancestral background.
  3. Insertions: The sequence is unique to the derived allele; mutations only occur on the derived background.
  4. Introgressions: The derived allele originates from a diverged population, introducing deeper genealogical branches.
Analytical Approach:
- The authors decompose the SFS into five distinct spectral components based on the relationship between a mutation and the SV allele:
  1. Strictly Nested (sn): Mutation is in a subset of the SV allele.
  2. Co-occurring (co): Mutation is in all SV alleles.
  3. Enclosing (en): SV allele is in a subset of the mutation.
  4. Complementary (cm): Mutations and SV alleles are mutually exclusive (one or the other).
  5. Strictly Disjoint (sd): Mutation is in a subset of the non-SV background.
- They derived explicit formulas for the expected number of mutations ( $E[\xi_k|i]$ ) for each component and SV type.
- For cases where ancestral states are unknown (e.g., insertions/introgressions), they utilized the folded SFS (minor allele frequency).
Bias Quantification:
- Bias was calculated as the deviation between the SV-conditional expectation and the standard neutral baseline:
  $\text{Bias}(T|i) = E_{SV}[T|i] - E_0[T]$
- This was applied to estimators ( $\theta_W, \pi$ ) and tests ( $D, H$ ).

3. Key Contributions

Analytical Derivation: The paper provides the first exact analytical expectations for the unfolded and folded SFS of neutral mutations linked to four distinct types of neutral SVs.
Decomposition of Spectral Components: It clarifies how different SV architectures (presence/absence of sequence, divergence) alter the genealogical structure and resulting SFS components.
Quantification of Bias: It quantifies how SV frequency and type systematically bias standard diversity metrics and neutrality tests under strict neutrality.
Correction Framework: It proposes a mathematical framework to build SV-aware estimators and centered neutrality tests that account for the SV's presence and frequency, effectively removing the bias.

4. Key Results

A. Impact on Diversity Estimators ( $\theta_W$ and $\pi$ )

Inversions & Introgressions: Intermediate to high-frequency SVs lead to a significant overestimation of genetic diversity compared to the neutral baseline. This is driven by the accumulation of mutations on the distinct SV background and, in the case of introgressions, fixed differences between the ancestral and introgressed lineages.
Deletions & Insertions: These result in a decrease in estimated variability.
- Deletions: Mutations are restricted to the ancestral background (fewer lineages), leading to lower diversity estimates, especially at high SV frequencies.
- Insertions: Mutations are restricted to the derived background. Even though estimators account for reduced sequence count, the specific spectral shape leads to underestimation relative to the standard baseline.

B. Impact on Neutrality Tests (Tajima's $D$ and Fay & Wu's $H$ )

Inversions & Introgressions:
- Intermediate Frequency: Characterized by an excess of intermediate-frequency mutations, leading to positive values for both $D$ and $H$ .
- High Frequency: Lead to an excess of rare ancestral alleles, pushing $D$ and $H$ toward negative values.
Deletions & Insertions:
- Intuitively, one might expect deletions to mimic population contraction (excess of rare alleles, $D < 0$ ) and insertions to mimic expansion (excess of rare alleles, $D < 0$ ).
- Counter-intuitive Finding: While the shape of the spectrum changes, the expected values of Tajima's $D$ and Fay & Wu's $H$ remain approximately centered (near 0) for indels under neutrality. The biases in the numerator and denominator of these statistics tend to cancel out, making them robust to neutral indels, unlike $\theta_W$ and $\pi$ .

C. Correction Strategies

The authors propose two methods to correct estimators for a known SV frequency $i$ :

Spectrum Redefinition: Redefine the null expectation in the denominator of the estimator using the derived SV-conditional spectrum ( $E_{SV}[\xi_k|i]$ ).
Renormalization: Rescale the standard estimator by a constant factor derived from the ratio of the SV-conditional expectation to the standard expectation.

5. Significance and Implications

Re-evaluation of Selection Signals: Many genomic regions showing signatures of selection (e.g., extreme Tajima's $D$ or high diversity) may actually be neutral regions linked to polymorphic SVs. This study provides the necessary null models to distinguish true selection from SV-induced artifacts.
Improved Genomic Scans: By implementing SV-aware estimators, researchers can reduce false positives in scans for positive or balancing selection.
Theoretical Foundation: The work bridges the gap between structural variation and population genetic theory, offering a rigorous mathematical basis for interpreting SFS data in the presence of complex genomic architectures.

6. Limitations and Future Directions

The authors acknowledge several simplifying assumptions:

Complete Linkage: The model assumes no recombination between the SV and the SNPs. In reality, recombination will decay the bias with distance from the SV.
Simple SV Architectures: The model assumes biallelic, single-origin SVs. Multi-allelic SVs, nested SVs, and complex repeat structures are not covered.
Genotyping Accuracy: The model assumes perfect detection and genotyping of SVs, ignoring biases from mapping and alignment errors common in empirical data.
Demography: The baseline assumes a constant population size. Extending this to complex demographic histories (bottlenecks, expansions) increases formal complexity.
Selection: The current derivation is for neutral SVs; modeling SVs under selection requires joint modeling of SV frequency dynamics and linked variation.

In conclusion, this paper establishes that neutral SVs are a major source of bias in standard population genetic statistics and provides the analytical tools to correct for them, thereby improving the accuracy of evolutionary inference.

Bias in diversity estimators and neutrality tests induced by neutral polymorphic structural variants

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

A. Impact on Diversity Estimators (θW\theta_WθW​ and π\piπ)

B. Impact on Neutrality Tests (Tajima's DDD and Fay & Wu's HHH)

C. Correction Strategies

5. Significance and Implications

6. Limitations and Future Directions

More like this

Reconciling the effects of PMS2 in different repeat expansion disease models supports a common expansion mechanism

Effect heterogeneity reveals complex pleiotropic effects of rare coding variants

Effects of knockdown of autophagy pathway genes on C. elegans longevity are highly condition dependent

Federated single-cell QTL meta-analysis reveals novel disease mechanisms

Resolution of the D4Z4 repeat responsible for facioscapulohumeral muscular dystrophy with HiFi sequencing

A. Impact on Diversity Estimators ( $\theta_W$ and $\pi$ )

B. Impact on Neutrality Tests (Tajima's $D$ and Fay & Wu's $H$ )