A run-length-compressed skiplist data structure for dynamic GBWTs supports time and space efficient pangenome operations over syncmers

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to navigate a massive, ever-expanding city that represents the genetic code of a species. Instead of just one straight road (a single reference genome), this city has millions of different routes, shortcuts, and detours because every person's DNA is slightly different. This collection of all possible routes is called a pangenome.

The problem is that this city is so huge and complex that traditional maps are too slow to use. If you try to find a specific address (a specific DNA sequence) in a standard map, you might get stuck in traffic or take hours to figure out where you are.

This paper introduces a new, super-fast way to build and navigate this genetic city. Here is the breakdown using simple analogies:

1. The Problem: The "Static" Map vs. The "Living" City

Previous methods for mapping these genetic cities were like building a giant, stone statue of the city. Once it was built, you couldn't change it easily. If a new road appeared (a new genetic variant), you had to tear the whole statue down and rebuild it. This was slow, clumsy, and didn't allow for dynamic updates.

2. The Solution: The "Skip-List" Elevator System

The author, Richard Durbin, introduces a new data structure called Rskip (a type of "Skip List").

The Analogy: Imagine a library with millions of books arranged in order.
- The Old Way: To find a book, you have to walk down every single aisle, checking every book one by one. This takes forever.
- The Skip-List Way: Imagine this library has a special system of elevators and express lanes.
  - On the ground floor, you have a long list of books.
  - On the second floor, there are "express" markers every 10 books.
  - On the third floor, there are markers every 100 books.
  - To find a book, you take the top elevator down to the floor where you are close to your target, then walk a short distance on that floor, then drop down to the next level, and so on.
- The Result: Instead of walking 1 million steps, you might only take 20 steps to find your book. This is what the "Skip List" does for DNA data—it lets you jump over huge chunks of information to find exactly what you need instantly.

3. The "Compressed" City Blocks

DNA data is full of repetition. Just like a city might have 1,000 identical houses in a row, DNA has long stretches of repeated sequences.

The Analogy: Instead of listing "House A, House A, House A..." 1,000 times, the new system writes: "House A (x1000)."
This is called Run-Length Compression. The "Rskip" system is designed specifically to handle these compressed lists efficiently, allowing the computer to jump over the "1000 houses" instantly without counting them one by one.

4. The "Syncmer" Landmarks

To navigate this city, you need landmarks. The author uses something called Syncmers.

The Analogy: Imagine you are navigating a city by looking for specific, unique street signs (like "The Big Red Barn" or "The Blue Fountain").
Syncmers are these unique street signs. They are small, specific snippets of DNA that act as reliable markers. Because they are chosen carefully, they appear frequently enough to guide you, but they are unique enough to tell you exactly where you are.
The system builds a map where the "streets" are made of these syncmer landmarks.

5. The Real-World Test: The Human Pangenome

The author tested this system on 92 full human genomes.

The Scale: This is a massive amount of data (about 280 billion letters of DNA).
The Speed:
- Building the Map: It took only 52 minutes on a single computer processor to build this entire dynamic map. (Previous methods would take much longer or require massive supercomputers).
- Searching: Once built, the system can scan through new DNA sequences at a speed of 1 billion letters every 10 seconds.
The Result: It successfully found "Maximal Exact Matches" (perfectly matching routes) in the data, proving it can handle the complexity of human genetics, including tricky areas like centromeres (the "dense, repetitive downtowns" of the genome).

Why Does This Matter?

Think of this as upgrading from a paper map to a live, GPS-enabled navigation app that updates in real-time.

For Doctors: It could help diagnose diseases by comparing a patient's DNA against a map of all human variation, not just one "average" human.
For Scientists: It allows them to study evolution and genetics much faster, handling data from thousands of people instead of just a few.
For the Future: The ultimate goal is Imputation. Imagine you only have a blurry, low-quality photo of a face (low-coverage DNA data). This system could use the "living city map" to fill in the missing details and reconstruct the full, high-definition face (the full genome) with high accuracy.

In short: This paper presents a new, lightning-fast, and flexible way to organize the world's genetic data, turning a static, heavy library into a dynamic, high-speed highway system for DNA.

1. Problem Statement

Current pangenome analysis relies on Graph Burrows-Wheeler Transforms (GBWTs) to store and search path sets through variation graphs. However, existing implementations face significant limitations:

Static Nature: Most GBWTs are built statically, making them cumbersome to update as new genomic data becomes available.
Scalability and Efficiency: Building and searching large-scale pangenomes (e.g., hundreds of human genomes) is computationally expensive.
Data Structure Limitations: Standard succinct data structures often assume small alphabets, whereas pangenome graphs can have effectively unbounded alphabet sizes (tens of thousands of unique vertices/syncmers) due to repetitive genomic regions.
Graph Representation: Existing tools like vg (using Giraffe) index DNA strings rather than graph vertices, and graphs built by tools like Minigraph-Cactus aim for linear homology with few cycles, whereas pangenome graphs often contain complex cycles from repetitive regions.

The goal is to create a dynamic, space-efficient, and time-efficient data structure that supports $O(\log N)$ rank, access, and insert operations on run-length compressed BWTs to facilitate pangenome construction and sequence matching.

2. Methodology

A. Theoretical Framework: Rskip Data Structure

The author introduces Rskip, a novel variant of Pugh's skip list (1990) designed specifically for run-length compressed arrays.

Structure: Rskip augments a standard linked list with multiple levels of "skip" pointers generated probabilistically.
Run-Length Compression: The data structure stores runs of identical symbols (e.g., vertex IDs) rather than individual elements, significantly reducing size.
Dual Skip Lists: To support efficient rank operations (counting occurrences of a symbol up to a position) without linear traversal, Rskip embeds a secondary skip list for each symbol within the primary structure.
- Pointers: It utilizes sRight (next node with the same symbol) and sLeft (previous node with the same symbol) pointers, along with sCount (cumulative count of the symbol).
- Operations:
  - Access: $O(\log R)$ where $R$ is the number of runs.
  - Rank: $O(\log R_s)$ where $R_s$ is the number of runs for a specific symbol $s$ .
  - Insert/Delete: $O(\log R)$ by updating pointers and counts on the stack of visited nodes.
Variants:
- Dynamic Rskip: Supports insertions and deletions, using bidirectional pointers (left, right, up, down, sLeft, sRight) and maintaining counts.
- Static Rskip: Optimized for search-only, using partial sums stored directly in nodes to speed up rank lookups.
- Linear Array: A simplified fallback for small run lists (max 128 nodes) to reduce overhead.

B. The `syng` Implementation

The Rskip structure is implemented within the syng package (available at github.com/richarddurbin/syng).

Syncmer Graphs: Instead of standard k-mers, syng uses closed syncmers (Edgar's criterion). These are fixed-length k-mers where a terminal sub-kmer has a hash value lower than all internal sub-kmers.
- Advantages: Syncmers provide a guaranteed window property (consecutive syncmers overlap by at least $s$ bases) and are context-independent, making them ideal for sparse de Bruijn-like graphs.
Graph Construction:
- Vertices represent syncmers (and their reverse complements).
- Edges represent adjacency in the input sequences.
- The graph is bi-directional.
Path Storage: The graph paths are stored using the GBWT. The local GBWT at each vertex acts as a routing table, mapping incoming path offsets to outgoing path offsets.
File Format: Data is stored in .1gbwt (ONEcode) files, a flexible binary format supporting schema-based serialization and compression, and .1khash files for syncmer sequence reconstruction.

3. Key Contributions

Rskip Data Structure: A novel, doubly-linked skip list variant that enables $O(\log N)$ operations on dynamic, run-length compressed BWTs with large alphabets.
Dynamic GBWT: The first practical implementation of a GBWT that supports efficient incremental updates (insertions) rather than requiring full reconstruction.
Syncmer-Based Pangenome: A framework (syng) that builds pangenome graphs using syncmers, effectively creating a sparse de Bruijn graph capable of handling complex repetitive regions (centromeres, etc.) that traditional linear-reference graphs struggle with.
Scalable Implementation: A single-threaded C implementation that successfully built a GBWT for 92 human genomes in under an hour.

4. Results

The author evaluated the system using the Human Pangenome Reference Consortium (HPRC) Phase 1 release (92 full human genomes, ~280 Gbp including centromeres).

Construction Performance:
- Time: Built the full bi-directional GBWT in 52 minutes (single-threaded).
- Memory: Peak memory usage was 15.7 GB.
- Disk Size: The resulting .1gbwt file was 5.8 GB (lossless).
- Comparison: Construction was significantly faster than tools like Minigraph-Cactus or PGGB used for the same dataset.
Search Performance:
- Dataset: 205 Gbp of PacBio HiFi reads from individual HG002.
- Throughput: Achieved a search rate of approximately 1 Gbp per 10 seconds per thread (2.3 seconds per Gbp in multi-threaded mode).
- Accuracy: Found 204 million Maximal Exact Matches (MEMs) with an average length of 1,304 bp. Only 249 reads failed to find matches (likely due to sequencing errors in homopolymer regions).
- Compression: When homopolymer-compressed, the average match length increased to ~6,300 bp, demonstrating the robustness of the syncmer approach.
Structural Statistics:
- The graph contained 339.8 million simple vertex sides and 46.2 million complex sides requiring Rskip objects.
- The average run length was 16.4, validating the efficiency of run-length compression.
- The expected directory list search length was low (6.3), confirming that linear searches for symbol lookup are efficient due to the skewed frequency distribution of pangenome edges.

5. Significance and Future Outlook

Scalability: The sublinear growth in construction time suggests the method can scale to the thousands of haplotypes envisioned in future pangenome projects.
Genotype Imputation: This framework lays the groundwork for next-generation genotype imputation. By efficiently storing long-range haplotype structures in the GBWT, it enables the identification of the most likely pair of haplotype sequences through the graph, even from low-coverage or short-read data.
Algorithmic Innovation: The Rskip structure offers a lightweight alternative to balanced trees for dynamic arrays with large alphabets, potentially applicable beyond genomics.
Integration: The syncmer logic is already being utilized in the syncasm assembler (part of the OATK toolkit), indicating immediate utility in genome assembly pipelines.

In summary, Durbin presents a robust, dynamic, and highly efficient solution for pangenome analysis, overcoming the static limitations of current GBWTs and enabling rapid, large-scale sequence matching and graph construction.

A run-length-compressed skiplist data structure for dynamic GBWTs supports time and space efficient pangenome operations over syncmers

1. The Problem: The "Static" Map vs. The "Living" City

2. The Solution: The "Skip-List" Elevator System

3. The "Compressed" City Blocks

4. The "Syncmer" Landmarks

5. The Real-World Test: The Human Pangenome

Why Does This Matter?

1. Problem Statement

2. Methodology

A. Theoretical Framework: Rskip Data Structure

B. The syng Implementation

3. Key Contributions

4. Results

5. Significance and Future Outlook

More like this

Functional-space alignment resolves the eco-evolutionary landscape of siderophore biosynthesis across bacteria

Exploring molecular signatures of senescence with markeR, an R toolkit for evaluating gene sets as phenotypic markers

Longevity Bench: Are SotA LLMs ready for aging research?

TFBindFormer: A Cross-Attention Transformer for Transcription Factor-DNA Binding Prediction

A little longer, a lot better: simulation-guided exploration of extended-length single-end barcoded reads for structural variant detection

B. The `syng` Implementation