Evaluating genome assemblies with HMM-Flagger

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to build a massive, intricate puzzle of a human genome. You have millions of tiny puzzle pieces (DNA reads) and a super-fast robot (an assembler) that tries to snap them together. Sometimes, the robot does a great job, but other times, it gets confused by tricky, repetitive patterns in the puzzle. It might accidentally glue two identical pieces together (a false duplication), or it might skip a whole section entirely, leaving a gap (a collapse), or it might just glue the wrong pieces together (an erroneous block).

The paper introduces a new tool called HMM-Flagger, which acts like a super-smart quality control inspector for these genome puzzles. Here is how it works, explained in everyday terms:

1. The Problem: The "Crowded Room" Analogy

To check if the puzzle is built correctly, the researchers use a clever trick. Imagine you have a photo of the finished puzzle (the assembly). Now, imagine you take all the original puzzle pieces (the DNA reads) and try to drop them back onto the photo.

Normal Area: If a section of the puzzle is correct, the pieces will land evenly. The "crowd" of pieces will be just the right density.
False Duplication: If the robot glued two identical pieces together by mistake, the crowd of pieces trying to land there will be too thin. Why? Because the pieces are confused—they don't know which of the two identical copies to land on, so they scatter, leaving the area looking empty.
Collapsed Block: If the robot missed a section and glued two different areas together, the crowd of pieces will be too thick. All the pieces meant for the missing section are now piling up on this one spot.
Erroneous Block: If the pieces are glued in a way that makes no sense, the pieces won't land there at all. The area will be completely empty.

2. The Solution: HMM-Flagger (The Smart Inspector)

Old tools were like inspectors who looked at the puzzle in giant, blurry chunks (5 megabytes at a time). They could see big problems but missed small ones, and they didn't talk to each other about what they saw in the next chunk.

HMM-Flagger is the upgrade. It uses a mathematical brain called a Hidden Markov Model (HMM) combined with a "memory" system (Gaussian Autoregressive Process).

The "Memory" Part: Imagine you are walking through a forest. If the ground is muddy for one step, it's likely muddy for the next step too. HMM-Flagger understands that DNA coverage doesn't jump randomly; it flows. If the "crowd" of pieces is thin in one spot, it expects the next spot to be thin too, unless there's a clear reason for a change. This helps it ignore small, random glitches and focus on real structural errors.
The "No Reference" Superpower: Most inspectors need a "perfect" master copy of the puzzle to compare against. But for humans, we don't have a perfect master copy for everyone. HMM-Flagger is special because it doesn't need a master copy. It just looks at the crowd density of the pieces and says, "Hey, this area looks weird compared to the rest of the puzzle," without needing to know what the "correct" version looks like.

3. What Did They Find?

The researchers tested this tool on some of the best human genome puzzles ever made (from the Human Pangenome Reference Consortium).

It Got Better: They compared "Release 1" of these puzzles to "Release 2." HMM-Flagger showed that Release 2 was much cleaner, with far fewer errors. It proved that the new technology was actually working.
It Found Hidden Treasures: The tool found a specific, very tricky area of the genome called NOTCH2NL (which helps our brains grow). It found three brand-new ways this gene is arranged in different people—configurations that no one knew existed before.
It Caught Mistakes: It spotted places where the robot assembler had accidentally duplicated a gene or collapsed a section, preventing scientists from making false conclusions about human genetics.

4. Why Does This Matter?

Think of genome assemblies as the "instruction manuals" for building a human. If the manual has typos or missing pages, doctors might think a patient has a disease when they don't, or miss a real disease.

HMM-Flagger is like a spell-checker for these instruction manuals. It ensures that when scientists study human genetics, they are looking at the truth, not a glitchy version of the data. It helps us build a more accurate "Pangenome" (a library of all human genetic variations), which is crucial for future medical breakthroughs.

In short: HMM-Flagger is a smart, reference-free tool that counts how many DNA pieces land in each spot to find where the genome puzzle was assembled incorrectly, helping us build a perfect map of human DNA.

1. Problem Statement

The rapid advancement of long-read sequencing technologies (PacBio HiFi and Oxford Nanopore) has enabled the creation of high-quality, haplotype-resolved, and telomere-to-telomere (T2T) genome assemblies. However, assembling highly repetitive regions (e.g., centromeres, segmental duplications, and satellite arrays) remains challenging. These regions are prone to structural errors such as:

Collapsed blocks: Where multiple copies of a sequence are merged into one, leading to under-representation.
False duplications: Where a single sequence is erroneously represented as multiple copies.
Erroneous blocks: Regions with severe base-level errors or misjoins.

Existing validation tools have significant limitations:

Reference-based methods (e.g., GQC) conflate true biological variation with assembly errors unless a "truth" assembly from the same sample exists (which is rare).
k-mer based methods (e.g., Merqury, Yak) struggle in repetitive regions due to the short length of k-mers.
Previous coverage-based tools (e.g., Flagger, NucFlag) often rely on fixed thresholds or treat genomic windows independently, failing to account for local coverage correlations or varying sequencing depths. They also lack the resolution to detect short misassemblies effectively.

2. Methodology: HMM-Flagger

The authors introduce HMM-Flagger, a reference-free tool that detects structural errors by analyzing read coverage patterns mapped back to the assembly.

Core Algorithm: Hidden Markov Model (HMM) with GARP

Input: Read mappings (BAM/SAM) of the same sample to the assembly.
State Space: The genome is divided into fixed windows (tuned per platform: 16kb for HiFi/ONT-R10, 8kb for ONT-R9). The HMM classifies each window into one of four states:
1. Haploid (Hap): Correctly assembled (baseline coverage).
2. Collapsed (Col): Coverage significantly higher than baseline (indicating merged copies).
3. Duplicated (Dup): Coverage significantly lower than baseline (indicating false splitting).
4. Erroneous (Err): Extremely low or absent coverage.
Emission Densities:
- Hap, Col, and Dup states are modeled using Gaussian Mixture Models (GMM).
- The Err state is modeled using a truncated exponential distribution.
- To handle overdispersion in real data, the model uses Gaussian distributions with decoupled mean and variance.
Gaussian AutoRegressive Process (GARP):
- Standard HMMs assume observations are conditionally independent given the state. HMM-Flagger augments the HMM with GARP to account for the fact that reads often span multiple windows, creating local correlations in coverage.
- The mean of the emission distribution for a window $t$ is a linear combination of the previous observation ( $x_{t-1}$ ) and a constant term, controlled by a hyperparameter matrix ( $\alpha_{ij}$ ).
Constraints & Filtering:
- MAPQ Constraints: Transitions to "Dup" states are restricted to regions with low mapping quality (ambiguous mappings), while "Col" transitions require high mapping quality.
- Contig End Correction: Adjusts coverage expectations near contig ends where mappers often discard short alignments.
- Bias Correction: Allows independent parameter estimation for known biased regions (e.g., centromeric satellites) to prevent false positives caused by platform-specific coverage biases.
- Self-Homology Filtering: Uses self-mappings of assembly contigs to validate predictions. If a "collapsed" region has no homologous mapping elsewhere, or a "duplication" has no redundant mapping, the prediction is filtered out to create a conservative call set.

Training and Optimization

Synthetic Data Generation: The authors developed a tool called Falsifier to introduce known misassemblies (deletions, insertions, SNPs) into the high-quality HG002-T2T-v1.1 reference.
Hyperparameter Tuning: A Bayesian Optimization approach (Efficient Global Optimization) was used to tune the GARP hyperparameters. The objective function maximized the similarity between HMM-Flagger's predictions and the ground-truth coordinates of the synthetic errors.

3. Key Contributions

Novel Algorithm: The first assembly evaluation tool to combine a Hidden Markov Model with a Gaussian AutoRegressive Process (GARP) to model local coverage dependencies, significantly improving detection accuracy in repetitive regions.
Reference-Free Validation: Provides a robust method for detecting structural errors without requiring a high-quality reference genome from the same individual.
Platform Agnostic: Successfully validated on both PacBio HiFi and Oxford Nanopore (R9 and R10) data, with specific tuning for each platform's error profiles and coverage biases.
Comprehensive Filtering: Introduces a multi-stage filtering pipeline (MAPQ constraints, contig-end correction, and self-homology checks) to reduce false positives, particularly in complex satellite regions.

4. Results

Benchmarking on Synthetic Data:
- On assemblies with a 3.32% misassembly rate, HMM-Flagger achieved F1 scores of 78.4% (HiFi) and 60.4% (ONT-R10).
- It significantly outperformed previous tools: Flagger (58.9% F1 on HiFi) and NucFlag (57.5% F1 on HiFi).
- The tool demonstrated high robustness to coverage depth, with only a marginal drop in F1 score when down-sampled from 40x to 20x.
- Detection sensitivity increased with misassembly size; it detected 96.58% of 320kb errors but had lower recall (72.92%) for 40kb errors. False duplications were the hardest category to detect.
Evaluation of HG002 Assemblies:
- Applied to six recent HG002 assemblies, HMM-Flagger identified large misassemblies, including a 1.5Mb false duplication in a PECAT assembly and a 150kb collapsed block in a T2T-v0.7 assembly.
- It revealed that ONT-based predictions were highly sensitive but required the conservative filtering step to achieve high precision.
Human Pangenome Reference Consortium (HPRC) Analysis:
- Release 1 vs. Release 2: HMM-Flagger quantified the improvement in HPRC assemblies. The global error rate dropped from 0.94% (Release 1) to 0.38% (Release 2) using HiFi data. False duplications showed the most significant improvement (0.62% $\to$ 0.22%).
- NOTCH2NL Validation: The tool was used to validate the complex NOTCH2NL gene family (critical for brain expansion).
  - Confirmed accurate resolution in 98% of Release 2 assemblies (up from 73% in Release 1).
  - Identified three novel structural configurations (H12, H13, H14) involving gene conversions and duplications.
  - Successfully flagged false duplications in specific samples (e.g., HG03521, NA20870) that would have otherwise been misclassified as valid structural variants.

5. Significance

HMM-Flagger represents a critical advancement in the field of genome assembly validation. By moving beyond fixed thresholds and independent window analysis, it provides a statistically rigorous framework for detecting structural errors in the most difficult-to-assemble regions of the human genome.

Its application to the HPRC demonstrates that it can effectively track technological progress in assembly algorithms, validating that newer releases are indeed more accurate. Furthermore, its ability to validate complex loci like NOTCH2NL is vital for clinical genomics, ensuring that structural variants associated with human traits and diseases are not artifacts of assembly errors. The tool is essential for the ongoing effort to create a complete, error-free human pangenome reference.