Shotgun DNA sequencing evidence: sample-specific and unknown genotyping error probabilities

Imagine you are a detective trying to solve a crime. You have two pieces of evidence:

The "Trace": A tiny, muddy, half-eaten cookie found at the crime scene. It's old, crumbly, and hard to read.
The "Reference": A fresh, perfect cookie from a suspect's lunchbox. It's crisp, clear, and easy to read.

Your goal is to determine: Did the suspect eat the cookie at the crime scene, or is the cookie from someone else entirely?

In the world of forensic science, these "cookies" are DNA samples. Usually, scientists look for specific patterns called STRs (like unique barcodes). But sometimes, the "trace" cookie is so old and degraded (like a telogen hair) that the barcode is too broken to read.

This is where Shotgun DNA Sequencing comes in. Instead of reading the whole barcode, scientists look at thousands of tiny, scattered letters (SNPs) across the entire genome. It's like trying to identify a person by reading random words from their diary instead of their full name.

However, there's a catch: Reading errors.

The Reference (fresh cookie) is read with a high-quality machine. It rarely makes mistakes.
The Trace (muddy cookie) is read with a struggling machine. It makes many mistakes because the DNA is damaged.

The Problem: The Old Model

Previously, scientists used a mathematical model (called wgsLR) to compare these two samples. But this old model had a blind spot: it assumed both samples were read with the same level of perfection. It thought, "If the trace sample has a typo, it must be because the suspect is different, not because the machine messed up."

This was dangerous. If the trace sample was actually very messy (high error rate), the model might blame the suspect for a mistake the machine made, or vice versa.

The Solution: The New Paper

Mikkel Meyer Andersen and his team updated the model to be smarter. They introduced three major improvements, which we can explain with simple analogies:

1. The "Two-Speed Camera" Analogy (Asymmetric Errors)

Imagine taking a photo of a suspect.

Photo A (Reference): Taken with a professional 8K camera in a studio. It's crystal clear.
Photo B (Trace): Taken with a blurry, old phone camera in the rain. It's full of static and noise.

The new model realizes that Photo B is naturally blurrier than Photo A. It assigns a specific "blur score" (error probability) to the trace and a different, lower "blur score" to the reference.

Why it matters: If the blurry photo has a smudge that looks like a different person, the model now says, "Ah, that's probably just the rain on the lens (an error), not a different person." This prevents false accusations.

2. The "Weather Forecast" Analogy (Overdispersion)

Sometimes, the "blur" isn't evenly spread. Maybe the top half of the photo is clear, but the bottom half is a mess.

The researchers asked: "What if the error rate changes from spot to spot?"
The Result: They found the model is incredibly robust. It's like a weather forecast that can still predict rain accurately even if the clouds are patchy. As long as the average error rate is known, the model works perfectly, even if some parts of the DNA are messier than others.

3. The "Gambler's Choice" Analogy (Unknown Errors)

Sometimes, you don't even know how bad the "muddy cookie" is. You have no idea if the error rate is 1% or 10%. How do you calculate the odds?
The paper offers two ways to handle this "unknown":

Method A: The "Best Guess" (Profile Likelihood)
Imagine a gambler who gets to pick the most favorable "error rate" for their side of the bet.
- If the evidence looks like the suspect, the model picks the error rate that makes that look most likely.
- If the evidence looks like a stranger, it picks the error rate that makes that look most likely.
- The Catch: This can sometimes be too optimistic. It might pick an error rate that is too high, making it seem like "any difference is just a mistake," which weakens the evidence against a real criminal.
Method B: The "Weighted Average" (Bayesian Integration)
Imagine a panel of experts. Each expert has a different guess about how bad the cookie is. You ask all of them, calculate the result for each guess, and then take the average.
- This is safer and more balanced. It doesn't rely on one "lucky" guess.

The Big Discovery: "Better Safe Than Sorry"

The most important finding of the paper is a rule of thumb for detectives:
It is safer to assume the evidence is cleaner than it actually is, rather than assuming it is dirtier.

If you assume the trace is too clean (low error rate): You might think a difference is a real mismatch. This is conservative; you might let a guilty person go, but you won't convict an innocent one.
If you assume the trace is too dirty (high error rate): You might think a difference is just a "machine error." This is dangerous. You might say, "Oh, that mismatch is just a typo," and wrongly conclude the suspect is innocent when they are actually guilty.

The Recommendation: Since we usually don't know the exact error rate of the old, messy trace sample, the paper suggests a practical trick: Just use the error rate of the clean reference sample.
Since the reference sample is usually very clean, assuming the messy trace is that clean is an "underestimate" of the errors. This is the safe, conservative choice that protects against false convictions.

Summary

This paper gives forensic scientists a new, upgraded calculator.

It understands that crime scene DNA is often messier than suspect DNA.
It works even if the messiness is uneven.
It provides safe ways to handle cases where we don't know exactly how messy the DNA is.

The result? A more reliable way to use shotgun DNA sequencing to identify people, ensuring that the math doesn't accidentally blame the machine for a crime, or let a criminal off the hook because of a "typo."

Here is a detailed technical summary of the paper "Shotgun DNA sequencing evidence: sample-specific and unknown genotyping error probabilities" by Mikkel Meyer Andersen.

1. Problem Statement

Forensic genetics traditionally relies on Short Tandem Repeat (STR) profiles obtained via PCR-capillary electrophoresis (PCR-CE). However, many forensic trace samples (e.g., telogen hairs, highly degraded DNA) contain insufficient quantity or quality of DNA to generate STR profiles.

Alternative: Shotgun Whole-Genome Sequencing (WGS) can extract valuable Single Nucleotide Polymorphism (SNP) data from these low-quality samples.
The Challenge: WGS is not error-free. Previous statistical models (specifically the wgsLR model by Andersen et al., 2025) assumed a single, symmetric genotyping error probability ( $w$ ) for both the trace sample (often low quality) and the reference sample (often high quality, e.g., buccal swab).
The Gap: In reality, trace samples often have significantly higher error rates ( $w_t$ $w_{t}$ ) than reference samples ( $w_r$ $w_{r}$ ). Furthermore, $w_t$ $w_{t}$ is often unknown because the trace sample quality is variable and difficult to characterize in the lab. There is a need for statistical models that:
1. Handle asymmetric error probabilities ( $w_t \neq w_r$ ).
2. Account for unknown $w_t$ without requiring precise prior knowledge.
3. Remain robust against overdispersion (where error rates vary across the genome).

2. Methodology

The paper extends the existing wgsLR model to address the identified gaps. The analysis was performed using the R programming language.

A. Model Extension: Asymmetric Error Probabilities

The authors updated the notation and mathematical framework to distinguish between the trace sample ( $t$ ) and the reference sample ( $r$ ).

Variables: $w_t$ (genotyping error probability for trace) and $w_r$ (genotyping error probability for reference).
Likelihood Ratio (LR): The model derives new formulas for the LR comparing two hypotheses:
- $H_1$ : The trace and reference samples come from the same individual.
- $H_2$ : The trace and reference samples come from different individuals.
Implementation: The formulas for calculating the LR with distinct $w_t$ and $w_r$ are provided in Table 1 of the paper and implemented in the R package wgsLR.

B. Handling Unknown $w_t$

Since $w_t$ is often unknown for trace samples, the paper investigates two primary approaches to calculate the Weight of Evidence (WoE = $\log_{10}(LR)$ ):

Bayesian Integration (Prior Predictive Distribution):
- Treats $w_t$ as a random variable following a prior distribution (Beta distribution on $(0, 0.5)$ ).
- Calculates the marginal likelihood by integrating the likelihood function over the prior distribution of $w_t$ .
- Formula: $P(E|H_i) = \int P(E|H_i, w_t) P(w_t) dw_t$ .
- This approach averages the evidence across possible error rates weighted by their prior probability.
Profile Likelihood Maximization:
- For each hypothesis ( $H_1$ and $H_2$ ), finds the specific value of $w_t$ that maximizes the likelihood of the observed data.
- The WoE is calculated using these maximum likelihood estimates.
- Formula: $WoE = \log_{10} \left( \frac{\max_{w_t} P(E|H_1, w_t)}{\max_{w_t} P(E|H_2, w_t)} \right)$ .

C. Simulation Design

The authors conducted extensive simulations to validate the models:

Overdispersion: Tested if the model could recover the mean error rate ( $w$ ) when individual genomic regions had varying error rates (simulated via Beta distributions).
Asymmetry & Unknowns: Simulated scenarios with varying numbers of independent SNP markers (50, 100, 200), different allele frequencies, and varying true values of $w_t$ (ranging from $10^{-4} $to$ 10^{-2}$).
Comparison: Compared methods including:
- Oracle (knowing true $w_t$ ).
- Plug-in estimate (assuming $w_t = w_r$ ).
- Bayesian integration (with various priors).
- Profile likelihood maximization.

3. Key Contributions

Asymmetric Error Model: The paper mathematically derives and implements the wgsLR model to handle distinct error probabilities for trace and reference samples, a critical advancement for real-world forensic casework where sample quality differs.
Strategies for Unknown Errors: It provides a rigorous framework for handling unknown trace sample error rates via Bayesian integration and profile likelihood maximization, offering practical alternatives when lab-based error estimation is impossible.
Robustness Validation: The study demonstrates that the model is highly robust to overdispersion, meaning accurate results can be obtained even if the error rate varies across the genome, provided the average error rate is correctly estimated.
Conservative Estimation Guidelines: The paper establishes that underestimating the trace sample error rate (e.g., assuming $w_t = w_r$ ) is more conservative and safer than overestimating it. Overestimating $w_t$ can lead to falsely attributing genotype mismatches to sequencing errors rather than different donors, potentially exonerating a suspect incorrectly.

4. Key Results

Overdispersion: The model successfully recovered the mean error probability ( $w=0.01$ ) even when individual genomic regions had significantly higher or lower error rates. The estimator is robust against overdispersion.
Handling Unknown $w_t$ :
- Correct Sign of WoE: With a sufficient number of markers (e.g., 200 SNPs), all methods (Integration, Profile Likelihood, and Plug-in) correctly identified the sign of the WoE (positive for $H_1$ , negative for $H_2$ ).
- Conservatism: Using a too-small $w_t$ (e.g., setting $w_t = w_r$ when $w_t$ is actually higher) yielded conservative results (WoE closer to 0). This is preferred in forensics as it avoids overstating the evidence.
- Profile Likelihood Issues: While simple, the profile likelihood method performed slightly worse with fewer markers (50–100). It occasionally produced the wrong sign for $H_2$ cases (incorrectly suggesting a match when samples were from different people) and failed to accurately estimate the true $w_t$ .
Impact of Marker Count: Increasing the number of independent markers significantly improved the accuracy and stability of the WoE across all methods.

5. Significance and Practical Recommendations

Forensic Applicability: This work enables the use of shotgun sequencing for low-quality forensic traces (like telogen hairs) where STRs fail, providing a statistically sound method to interpret SNP data.
Software Implementation: The extensions are implemented in the open-source R package wgsLR, making these advanced statistical methods accessible to forensic practitioners.
Practical Recommendation:
- If the trace sample error rate is unknown, it is more conservative and recommended to assume $w_t = w_r$ (using the known, low error rate of the reference sample) rather than guessing a higher value.
- Alternatively, if a prior distribution is used, the mean should be set to the reference sample's error rate.
- For cases with a large number of markers (e.g., >100), simple plug-in estimates or integration over a prior centered on $w_r$ are sufficient and robust.
- The profile likelihood method is less reliable for small marker sets and may lead to incorrect conclusions regarding the source of the DNA.

In summary, this paper bridges the gap between high-throughput sequencing capabilities and the statistical rigor required for forensic evidence, specifically addressing the critical issue of variable and unknown error rates in degraded samples.

Shotgun DNA sequencing evidence: sample-specific and unknown genotyping error probabilities

The Problem: The Old Model

The Solution: The New Paper

1. The "Two-Speed Camera" Analogy (Asymmetric Errors)

2. The "Weather Forecast" Analogy (Overdispersion)

3. The "Gambler's Choice" Analogy (Unknown Errors)

The Big Discovery: "Better Safe Than Sorry"

Summary

1. Problem Statement

2. Methodology

A. Model Extension: Asymmetric Error Probabilities

B. Handling Unknown wtw_twt​

C. Simulation Design

3. Key Contributions

4. Key Results

5. Significance and Practical Recommendations

More like this

Two-stage Adaptive Design Cluster Randomised Trials

Change Point Detection for Cell Populations Measured via Flow Cytometry

Preoperative Decline and Postoperative Recovery of Wearable-Derived Physical Activity Over a Four-Year Perioperative Period in Total Knee and Hip Arthroplasty: Evidence from the All of Us Research Program

Robust Estimation of Location in Matrix Manifolds Using the Projected Frobenius Median

Two Localization Strategies for Sequential MCMC Data Assimilation with Applications to Nonlinear Non-Gaussian Geophysical Models

B. Handling Unknown $w_t$