Sassy2: Batch Searching of Short DNA Patterns

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a librarian in a massive, chaotic library (the human genome or a DNA sequencing machine). Your job is to find specific, short book titles (like barcodes, primers, or CRISPR guides) hidden inside millions of pages of text.

The problem is that the pages are messy. Some letters are smudged, some are missing, and some are extra. You aren't looking for a perfect match; you are looking for "close enough" matches (allowing for a few typos).

This is the job of Sassy2. Here is how it works, explained through simple analogies.

The Old Way: The "One-by-One" Search

Before Sassy2, there was a tool called Sassy1. Imagine Sassy1 is a very fast librarian who can read a whole page in one second. However, if you give them 100 different book titles to find, they have to:

Scan the whole library for Title A.
Scan the whole library for Title B.
Scan the whole library for Title C... and so on.

Even though they are fast, doing this 100 times takes a long time. Also, if the library is small (a short DNA read), Sassy1 gets confused because it's built for huge libraries, not small ones.

The New Way: The "Super-Team" (Sassy2)

Sassy2 changes the game by using SIMD (Single Instruction, Multiple Data). Think of this not as one librarian, but as a super-team of 32 librarians standing in a row, all reading the same page at the exact same time.

Instead of searching for one title at a time, Sassy2 gives each librarian a different title to look for simultaneously.

Librarian 1 looks for Title A.
Librarian 2 looks for Title B.
Librarian 3 looks for Title C...

They all scan the text together. This is the Pattern Tiling. It's like having 32 pairs of eyes scanning the same wall for 32 different "Wanted" posters at once.

The Secret Sauce: The "Suffix Filter"

Here is the tricky part: Checking if a word matches perfectly takes time. If you have 32 librarians checking 32 different words against a page, and they all have to check every single letter, it's still slow.

Sassy2 uses a clever trick called Suffix Filtering.

Imagine you are looking for the word "Elephant" in a crowd. You don't need to check the whole word to know it's not an elephant. You just need to check the end of the word.

If the word ends in "apple," you know immediately it's not "Elephant."
You only need to do the full, detailed check if the word ends in "phant."

Sassy2 does this with DNA:

The Quick Scan: It first looks at just the last few letters (the "suffix") of all 32 patterns. It does this super fast because it's checking a tiny piece of the puzzle.
The Filter: If a piece of text doesn't match the "suffix" (e.g., the last 16 letters don't look right), Sassy2 instantly says, "Nope, not a match!" and moves on.
The Deep Dive: Only if the suffix does look promising does the team stop and do the full, detailed check on the whole word.

This saves a massive amount of time because most random text in the library won't match the suffix. Sassy2 filters out the noise before doing the heavy lifting.

Why is this a big deal?

The paper shows that Sassy2 is incredibly fast:

Speed: It is 20 to 45 times faster than the current standard tools (like Edlib) and 2 to 5 times faster than its predecessor (Sassy1).
Short Reads: It works amazingly well on short pieces of text (like short DNA reads from a machine), where older tools were slow and clumsy.
Real World:
- CRISPR: It can scan the entire human genome for 312 different "cutting guides" in seconds, helping scientists edit genes safely.
- Nanopore: It can sort through millions of DNA samples to find specific "barcodes" (like sorting mail) in the blink of an eye.

The Catch

There is one small rule: Sassy2 works best when all the "book titles" you are looking for are the same length. If you have a mix of 20-letter words and 50-letter words, you have to sort them into separate piles first. But for most modern DNA tasks, this is a small price to pay for the incredible speed.

Summary

Sassy2 is like upgrading from a single detective searching a city block to a swarm of 32 drones scanning the whole city at once, using a quick-look filter to ignore 99% of the buildings before they even land. It makes finding tiny, slightly messy DNA patterns in massive amounts of data almost instantaneous.

1. Problem Statement

The paper addresses the challenge of Multiple Approximate String Matching (MASM) in bioinformatics. Specifically, it focuses on searching for batches of short DNA patterns (e.g., barcodes, primers, CRISPR spacers, typically 20–40 bp) within large genomic texts or sequencing reads, allowing for a small number of errors ( $k$ ) such as mismatches, insertions, and deletions.

Key Challenges:

Inefficiency of Classical Methods: Traditional seeding-based approaches fail for short patterns as the error threshold $k$ increases, generating too many false positives or missing true matches.
Limitations of Previous Work (Sassy1): The authors' previous tool, Sassy1, utilized "text-tiling" (splitting a long text across SIMD lanes) to accelerate single-pattern searches. However, this approach is inefficient for short texts (where the text length $n$ is comparable to the SIMD register width) and for batch processing multiple patterns, as it requires rescanning the text for each pattern and leaves SIMD lanes underutilized.
Hardware Constraints: While modern CPUs support wide SIMD registers (AVX2: 256-bit, AVX-512: 512-bit), packing many short patterns into a single machine word (intra-word parallelism) is often limited by the pattern length relative to the word size.

2. Methodology

Sassy2 introduces a novel SIMD-optimized pattern-tiling approach combined with a suffix filter to maximize parallelism when searching batches of short, equal-length patterns.

Core Architecture

SIMD Pattern Tiling: Instead of splitting the text, Sassy2 distributes a batch of $L$ patterns across $L$ independent SIMD lanes. Each lane maintains its own Myers bit-vector state. This allows the algorithm to scan the text once while comparing every text character against all $L$ patterns simultaneously.
Myers' Bit-Parallelism: The implementation builds on Myers' bit-vector algorithm, encoding Dynamic Programming (DP) matrix columns into machine words to compute edit distance in $O(n \lceil m/w \rceil)$ time.

The Suffix Filter (Two-Stage Approach)

To overcome the inability of standard pattern-tiling to perform "early rejection" (stopping calculation before the full pattern is processed), Sassy2 employs a two-stage filtering strategy:

Stage 1: Suffix Filtering (Fast Rejection):
- The algorithm first searches only the suffixes of the patterns (e.g., the last 16 bp of a 32 bp pattern).
- It uses a reduced lane width ( $w' < w$ ), such as 8, 16, or 32 bits, which increases the number of available SIMD lanes ( $L' = W/w'$ ).
- This stage processes the text against these shorter suffixes. If a suffix match exceeds the error threshold $k$ , the position is rejected immediately.
- Note: The suffix length is empirically tuned based on $k$ (e.g., $w'=16$ for $0 < k < 4$ ).
Stage 2: Full Pattern Verification:
- Only positions where the suffix matches with cost $\le k$ proceed to full verification.
- The algorithm computes the full Myers DP matrix for the entire pattern against the relevant text slice.
- Batch Tracing: If multiple suffix matches cluster together, they are grouped into contiguous ranges. The DP matrix is computed for the entire range in parallel, amortizing the cost of matrix construction and reducing redundant calculations.

Match Reporting

Sassy2 supports two reporting modes:

Reporting all end positions where an alignment of cost $\le k$ exists.
Reporting only the rightmost local minima (minimum cost alignments).
It also supports "overhang costs" for semi-global alignment, allowing reduced penalties for unaligned regions at the start or end of the text.

3. Key Contributions

Multi-Pattern SIMD Implementation: A practical Rust implementation that packs multiple short, equal-length patterns into SIMD lanes, optimizing pattern encoding for efficient SIMD loading.
Suffix Filter Mechanism: A computationally cheap, two-stage approach that recovers early rejection capabilities in pattern-tiling. By filtering on short suffixes with higher parallelism (more lanes), it drastically reduces the number of expensive full-pattern verifications required.
Batch Optimization: Unlike Sassy1, which is optimized for long texts and single patterns, Sassy2 is specifically designed for high-throughput batch searching of short patterns, making it ideal for modern genomics workflows involving thousands of guides or barcodes.

4. Results

The authors evaluated Sassy2 against Sassy1 and Edlib (a standard non-SIMD library) on synthetic and real-world datasets using an Intel Xeon Gold 6530 CPU (supporting AVX-512).

Synthetic Data Performance

Short Texts ( $n \le 200$ bp): Sassy2 achieves 10–50× speedup over Sassy1 and up to 467× speedup over Edlib. Sassy2 maintains high throughput even for very short reads where Sassy1 is inefficient.
Large Texts ( $n \ge 1$ Mbp): Sassy2 achieves 2–4× speedup over Sassy1 and significantly outperforms Edlib.
Scalability: Throughput scales near-linearly with the number of patterns ( $r$ ) until SIMD lanes are saturated (e.g., 32 patterns for AVX-512).

Real-World Applications

CRISPR Off-Target Searching: Searching 312 gRNAs (23 bp) against the human genome (3.12 Gbp) with $k=3$ $k = 3$ .
- Sassy2: 105.9 Gbp/s per pattern (30 ms per guide).
- Sassy1: 28.6 Gbp/s (3.7× slower).
- Edlib: 3.0 Gbp/s (35.7× slower).
Nanopore Barcode Demultiplexing: Searching 96 barcodes (24 bp) in 334 Mbp of Nanopore reads with $k=3$ $k = 3$ .
- Sassy2: 116.8 Gbp/s per pattern (0.27 s total).
- Sassy1: 25.5 Gbp/s (4.6× slower).
- Edlib: 2.57 Gbp/s (45× slower).

5. Significance and Limitations

Significance:
Sassy2 represents a major leap in bioinformatics search efficiency, particularly for tasks involving short, batched patterns. By leveraging modern SIMD hardware more effectively than previous methods, it enables rapid, high-sensitivity searches across whole genomes and long-read datasets. It demonstrates that "early ideas" in algorithmic optimization (like bit-packing) can be successfully adapted to modern hardware architectures (SIMD) to solve contemporary biological problems.

Limitations:

Equal Length Requirement: Currently, Sassy2 requires all patterns in a batch to be of equal length. Variable-length patterns must be processed in separate batches.
Future Work: The authors suggest combining the Sassy2 suffix filter with text-tiling (Sassy1's approach) as a promising direction for handling variable-length patterns or optimizing for different text/pattern size ratios.

Availability:
Sassy2 is implemented in Rust and is open-source at github.com/RagnarGrootKoerkamp/sassy.