Model selection in ADMIXTURE can be inconsistent: proof of the K=2 phenomenon

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery: Who are these people, and where do they come from?

You have a massive pile of genetic data (like thousands of DNA fingerprints) from different people. Your goal is to group them into distinct "ancestral tribes." But there's a catch: you don't know how many tribes actually exist. Is it 2? 3? 10?

To solve this, scientists use a popular computer program called ADMIXTURE (and its cousin, STRUCTURE). This program tries to guess the number of tribes ( $K$ ) by looking for the "best fit."

The Problem: The "K=2" Trap

For years, scientists have noticed a weird glitch. When they use a specific rule to pick the best number of tribes, the computer almost always screams, "It's 2!"

Even if there are clearly 3, 4, or 5 distinct groups in the data, the rule keeps forcing them into just two big buckets. This is frustrating because it hides the real, subtle history of the people being studied. It's like trying to sort a box of mixed fruits (apples, oranges, and bananas) and the scale keeps telling you, "There are only two types of fruit here: Red and Yellow," ignoring the bananas entirely.

The Paper's Discovery: Why the Scale is Broken

This paper, written by Dat Do and Jonathan Terhorst, acts like a mechanic opening the hood of that scale to explain why it's broken. They didn't just say, "Hey, it happens sometimes." They proved mathematically that under certain conditions, the rule is guaranteed to be wrong, even if you have infinite data.

Here is the explanation using a simple analogy:

The Analogy: The "Elbow" in the Graph

Imagine you are climbing a mountain, and you want to find the spot where the path gets steepest. The rule the scientists use (called Evanno's $\Delta K$ ) looks for a sharp "elbow" or bend in the graph of how well the model fits the data.

The Logic: The rule assumes that if you add a new tribe (go from 2 to 3), the fit should improve a lot. If you go from 3 to 4, the improvement should be smaller. The "elbow" is where the big jump stops and the small jumps begin.
The Flaw: The authors proved that if the tribes are too similar to each other (genetically close), the "jump" from 2 to 3 looks tiny. But the jump from 1 to 2 looks huge.
The Result: The rule sees the huge jump at the start and thinks, "Aha! That's the elbow! The answer is 2!" It completely misses the fact that there is a third, slightly different group hiding in the noise.

The "Cousin" Effect

Think of three families:

Family A (Lives in the mountains).
Family B (Lives in the valley).
Family C (Lives in the next valley over).

Family B and Family C are very close neighbors; they share a lot of DNA because they've been trading and marrying for centuries. Family A is a bit more distant.

The computer tries to group them.

If it guesses 2 groups, it might put B and C together as "Valley People" and A as "Mountain People." This is a decent guess.
If it guesses 3 groups, it tries to separate B and C. But because B and C are so similar, the computer struggles to find a clear line between them. The "improvement" in the guess is very small.

The rule ( $\Delta K$ ) looks at the improvement. It sees a massive improvement when splitting A from the rest, but a tiny, almost invisible improvement when splitting B from C. So, it decides the "elbow" happened at 2 and stops there. It fails to see the third group.

The "Drift" Factor

The paper also explains when this happens using a concept called $F_{ST}$ (a measure of how different populations are).

High Difference: If the populations are very different (like humans vs. chimps), the rule works fine.
Low Difference: If the populations are closely related (like different European countries or distinct indigenous groups that split recently), the "drift" (genetic change over time) is small.

The authors proved that if the genetic drift between the "cousin" groups (B and C) is small enough compared to the distance to the "distant" group (A), the rule will fail and force the answer to be 2.

The Takeaway for Everyone

This isn't just a math problem; it has real-world consequences. If a conservation biologist is studying endangered species and uses this rule, they might conclude there are only two distinct populations when there are actually three. This could lead to bad decisions about how to protect them.

The authors' advice:
Don't blindly trust the computer's "best guess" number.

Look at the whole picture: Don't just pick the number the rule gives you. Look at the results for $K=2$ , $K=3$ , $K=4$ , etc.
Use your brain: Combine the computer's math with what you know about biology and history.
Be skeptical: If the rule says "2," but you know the groups are complex, the rule might be falling into the "K=2 trap" because the groups are too similar for the math to handle easily.

In short: The computer is a powerful tool, but sometimes it gets lazy and picks the easiest answer (2) instead of the true, complicated answer. This paper explains exactly why that happens so we can stop trusting it blindly.

1. Problem Statement

The paper addresses a persistent and critical issue in population genetics: the selection of the number of ancestral populations ( $K$ ) in model-based clustering algorithms like STRUCTURE and ADMIXTURE.

The Context: These models infer individual ancestry proportions by assuming data is generated from a mixture of $K$ latent populations. The choice of $K$ is non-trivial; underfitting ( $K$ too low) obscures real structure, while overfitting ( $K$ too high) interprets noise as structure.
The Specific Issue: The most widely used heuristic for selecting $K$ is Evanno's $\Delta K$ method. This method identifies an "elbow" in the log-likelihood curve by calculating the second-order rate of change of the log-likelihood as $K$ increases.
The Phenomenon: Practitioners have empirically observed that $\Delta K$ frequently selects $K=2$ , even when more complex substructure exists (e.g., $K=3$ ). This bias has been documented in large-scale surveys of genetic studies, potentially leading to erroneous conclusions in conservation biology and evolutionary history.
The Gap: While the tendency of $\Delta K$ to underfit is known empirically, there has been no rigorous theoretical proof explaining why this occurs, particularly why it persists even with infinite data (inconsistency).

2. Methodology

The authors provide a theoretical framework to prove the inconsistency of the $\Delta K$ method under specific conditions.

Model Framework:
- The study focuses on the Maximum Likelihood Estimation (MLE) version of the admixture model (implemented in ADMIXTURE).
- Data Generation: They assume a "true" scenario with $K_0 = 3$ populations. Individuals are partitioned into three distinct groups ( $N_1, N_2, N_3$ ) with pure ancestry ( $Q_{nk} \in \{0, 1\}$ ).
- Assumptions:
  1. Boundedness: Allele frequencies are bounded away from 0 and 1 to prevent log-likelihood divergence.
  2. Tree-like Structure: Populations 2 and 3 are more closely related to each other than either is to Population 1.
The $\Delta K$ Criterion (Adapted):
- The authors define a simplified, un-normalized version of Evanno's statistic to facilitate theoretical analysis:
  $\hat{\Delta}(K) = |2\hat{L}(K) - \hat{L}(K-1) - \hat{L}(K+1)|$
  where $\hat{L}(K)$ is the maximum log-likelihood for a given $K$ .
- The selected $\hat{K}$ is the value that maximizes $\hat{\Delta}(K)$ .
Theoretical Approach:
- The authors analyze the asymptotic behavior of the log-likelihoods as the number of individuals ( $N$ ) and SNPs ( $L$ ) approach infinity.
- They utilize Kullback-Leibler (KL) divergence to measure the information loss when merging populations.
- They derive a condition based on population genetic divergence (specifically $F_{ST}$ parameters) under a nested Balding-Nichols model, which simulates realistic hierarchical population drift.

3. Key Contributions

Proof of Inconsistency: The paper provides the first rigorous mathematical proof that the $\Delta K$ method is inconsistent. It demonstrates that under specific conditions, the method will select $\hat{K}=2$ with probability approaching 1, even as $N, L \to \infty$ , despite the true number of populations being $K_0=3$ .
Explicit Sufficient Conditions: The authors derive a precise inequality involving KL divergences ( $D_{31}$ and $D_{32}$ ) that determines when the method fails.
Connection to $F_{ST}$ : They translate the abstract information-theoretic condition into a practical population genetic context using the Balding-Nichols model, linking the failure of $\Delta K$ to specific ratios of drift parameters ( $F_{root}$ and $F_{sub}$ ).

4. Key Results

Theoretical Result (Theorem 1)

The method fails (selects $K=2$ ) if the information loss from merging the two closely related populations (2 and 3) is small relative to the total heterogeneity of the three populations.

Let $D_{31}$ be the total heterogeneity (KL divergence of all three populations from their mean).
Let $D_{32}$ be the divergence between the cluster of populations {2,3} and their mean.
Condition for Failure: If $D_{32} < \frac{1}{3} D_{31}$ , then $\hat{K} \to 2$ as $N, L \to \infty$ .
Interpretation: If populations 2 and 3 are sufficiently similar to each other compared to how different they are from population 1, the "elbow" in the likelihood curve appears at $K=2$ .

Application to Realistic Models (Theorem 2)

Using the nested Balding-Nichols model (Figure 1), where $F_{root}$ represents the drift separating Population 1 from the {2,3} ancestor, and $F_{sub}$ represents the drift separating 2 from 3:

Threshold Condition: The method selects $K=2$ if the drift parameters are sufficiently small and satisfy:
$F_{root} / F_{sub} > 3/4$
Simulation Verification: The authors simulated 99 parameter settings with $N=150$ $N = 150$ and $L=2000$ $L = 2000$ SNPs.
- Result: The simulations confirmed a sharp phase transition. When the ratio $F_{root}/F_{sub}$ exceeded 0.75, the $\Delta K$ method consistently selected $K=2$ . When the ratio was lower (meaning populations 2 and 3 were relatively more distinct), it correctly selected $K=3$ .
- Relevance: These $F_{ST}$ values are typical of human populations, explaining why the phenomenon is frequently observed in real-world data.

5. Significance and Implications

Theoretical Explanation: This work moves the discussion of $\Delta K$ bias from empirical observation to mathematical certainty, explaining why the method underfits in hierarchical structures with low divergence.
Practical Guidance:
- Researchers should not rely solely on a single $\Delta K$ peak, especially when populations are closely related (low $F_{ST}$ ).
- The results suggest that when $F_{root}/F_{sub}$ is high (a deep split between one group and a pair of similar groups), $\Delta K$ will likely collapse the similar pair into a single cluster.
Broader Impact: While the proof focuses on MLE/ADMIXTURE, the authors conjecture that any model selection technique relying on comparing log-likelihoods across $K$ may suffer from similar underfitting issues in closely related population scenarios.
Recommendation: The authors advocate for reporting results across a range of $K$ values and integrating biological context rather than relying on a single automated metric.

In summary, the paper proves that Evanno's $\Delta K$ is not a universally consistent estimator for the number of populations. It mathematically characterizes the "K=2 phenomenon" as a consequence of the method's inability to distinguish between a three-population hierarchy and a two-population split when the sub-structure is sufficiently subtle relative to the primary divergence.