Compressed inverted indexes for scalable sequence similarity

This paper introduces Onika, a Rust-based system that utilizes compressed inverted indexes and early-pruning schemes to enable scalable, output-sensitive all-vs-all sequence similarity comparisons with orders-of-magnitude speedups over existing forward-index tools while maintaining rigorous accuracy guarantees.

Original authors: Ingels, F., Vandamme, L., Girard, M., Agret, C., Cazaux, B., Limasset, A.

Published 2026-02-17
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a librarian in a library that has grown so massive it now contains billions of books (representing DNA sequences from bacteria, viruses, and humans). Your job is to find books that are similar to each other—maybe to find different strains of the same bacteria or to spot errors in genetic data.

In the past, to find similar books, you had to take every single book off the shelf, open it, and compare its first few sentences with every other book in the library. This is like the old way computers did it: slow, exhausting, and impossible when the library gets too big.

This paper introduces a new, super-fast system called Onika that solves this problem using three clever tricks. Here is how it works, explained simply:

1. The "Fingerprint" Shortcut (Sketching)

Instead of reading the whole book, imagine you take a tiny fingerprint of each one.

  • Old Way: You compare the whole book (thousands of pages).
  • New Way: You only compare a small, unique set of 512 numbers (the fingerprint) that represents the book.
  • The Problem: Even with fingerprints, if you have a billion books, you still have to compare every fingerprint against every other fingerprint. That's a trillion comparisons! It's like trying to match every person in a stadium with every other person by shaking hands.

2. The "Phone Book" Trick (Inverted Index)

This is the paper's biggest breakthrough. Instead of listing books and then their fingerprints (Forward Index), Onika builds a giant phone book (Inverted Index).

  • The Old Way (Forward Index):

    • Book A: Fingerprint 123, 456, 789
    • Book B: Fingerprint 123, 999, 101
    • To find matches: You have to scan Book A, then Book B, then Book C... comparing them one by one.
  • The Onika Way (Inverted Index):

    • Fingerprint 123: Appears in Book A, Book B, Book Z...
    • Fingerprint 456: Appears in Book A, Book K...
    • To find matches: You look up "123" in the phone book. Boom! The book instantly tells you, "Hey, Book A and Book B both have this fingerprint. They are similar!" You don't need to check the other billions of books that don't have that fingerprint.

The Analogy:
Imagine you are looking for people who share a specific birthday.

  • Old Way: You walk up to every person in the stadium and ask, "What is your birthday?" Then you compare lists.
  • Onika Way: You have a list sorted by birthday. You just look at the "January 1st" list. Everyone on that list is a match. You ignore everyone born on other days instantly.

The authors proved mathematically that this "phone book" takes up exactly the same amount of memory as the old way, but it is much faster because it skips the boring stuff.

3. The "Early Exit" Strategy (Pruning)

Sometimes, you don't need to check every fingerprint to know two books are different.

  • The Scenario: You are comparing Book A and Book B. You check the first 10 fingerprints, and they only match once. You know they are very different.
  • The Old Way: The computer keeps checking all 512 fingerprints just to be sure.
  • The Onika Way: It has a smart rule. "If they haven't matched enough by now, stop checking! They are definitely not similar enough to care about."
  • The Result: It stops the comparison early, saving massive amounts of time and energy. It's like a bouncer at a club who sees you don't have a ticket and stops you at the door instead of letting you in to check your ID later.

4. Organizing the Shelves (Reordering)

Finally, Onika is smart about how it stores the data.

  • If you have 1,000 copies of the same book (redundant data), Onika realizes they are similar.
  • It rearranges the library so that similar books are sitting right next to each other on the shelf.
  • This makes the "phone book" lists much shorter and easier to compress (like packing a suitcase tightly). This saves even more space and makes the computer run faster.

The Bottom Line

The authors built a tool called Onika (written in a programming language called Rust) that uses these tricks.

  • Speed: It is thousands of times faster than current tools when comparing huge collections of DNA.
  • Memory: It doesn't need more computer memory than the old tools; it just uses it smarter.
  • Accuracy: It doesn't miss any important matches; it just ignores the ones that are obviously unimportant.

In short: They turned a slow, brute-force search into a smart, organized, and lightning-fast system, allowing scientists to analyze the entire "library of life" without waiting years for the results.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →