Efficient Grammar Compression via RLZ-based RePair

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a massive library of books, but they are all copies of the same story with just a few different words here and there. Your goal is to shrink these books down to fit in your pocket without losing a single word of the story.

This is the challenge of Grammar Compression. It's like trying to write a "cheat sheet" or a recipe book that can reconstruct the entire library.

The Old Way: The "RePair" Chef

There is a famous method called RePair. Imagine a chef who looks at a giant stack of text and says, "Hey, the phrase 'the quick' appears a million times! Let's replace every single 'the quick' with a special symbol, like a star (★)." Then, they look for the next most common pair, like "brown fox," and replace that with a moon (🌙).

This works incredibly well to shrink the text. However, there's a catch: The chef needs to see the entire library at once. To find the most common pairs, the chef has to load every single book onto the kitchen counter. If the library is huge (like the entire human genome or millions of virus sequences), the kitchen counter explodes. The chef runs out of space, the counter collapses, and the job fails.

The New Way: The "RLZ" Librarian

To solve the "too big for the kitchen" problem, another method called RLZ (Relative Lempel-Ziv) was invented. Imagine a librarian who doesn't look at the whole library. Instead, they pick one book as a "Reference Book."

When they need to compress a new book, they say, "Okay, this part of the new book is exactly the same as pages 10–20 of the Reference Book. I'll just write down 'See Reference, pages 10–20'." They do this for the whole book.

This is super efficient! The librarian only needs to keep the Reference Book and a tiny list of instructions. But there's a downside: The librarian is a bit lazy. They only look for matches that line up perfectly with the Reference Book. They miss clever, hidden patterns that cross the boundaries of their instructions. The resulting "cheat sheet" isn't as smart or compact as the Chef's RePair method.

The Solution: "RLZ-RePair" (The Best of Both Worlds)

The authors of this paper, Rahul Varki, Travis Gagie, and Christina Boucher, created a new method called RLZ-RePair.

Think of it as hiring a Master Chef who also knows how to use a Librarian's shortcut.

Here is how it works, step-by-step:

The Setup: Instead of loading the whole library, they pick a "Reference Book" (like the RLZ method). They break the target text into chunks that match the Reference Book.
The Smart Scan: Now, instead of looking at the whole text, the Chef only looks at the Reference Book and the list of instructions.
- If the most common phrase is "the quick," and it appears inside the Reference Book, the Chef replaces it in the Reference Book.
- The Magic Trick: Because the instructions just say "See Reference, pages 10–20," when the Chef changes the Reference Book, all the instructions automatically update! They don't have to touch the millions of copies; they just change the master copy.
Handling the Edges: Sometimes, a common phrase might be split right between two chunks (e.g., the end of one chunk and the start of the next). The algorithm has a special rule to handle this: it "unlocks" those specific edge characters, treats them as regular text, and then continues the smart compression.

Why is this a Big Deal?

Memory Savings: In their tests, this new method used 80% less memory than the old RePair method. It's like shrinking a 100GB hard drive down to 20GB while doing the same job.
Perfect Quality: Unlike other "fast" methods that cut corners and produce messy cheat sheets, RLZ-RePair produces the exact same perfect cheat sheet as the original, memory-hungry RePair.
Scalability: They tested this on massive datasets (400,000 virus genomes and 1,000 human chromosomes). The old method crashed or took forever. The new method finished the job using less than half the available memory.

The Bottom Line

Imagine you want to compress a massive, repetitive dataset.

Old RePair: Tries to hold the whole ocean in a bucket. It's perfect but the bucket breaks.
Old RLZ: Uses a small cup to scoop water. It fits, but the bucket is full of holes (it's not very compressed).
RLZ-RePair: Uses a magical hose that connects the cup to the ocean. It gets the perfect compression of the ocean but only needs the cup to hold the water.

This new algorithm allows scientists and data engineers to compress huge biological and web datasets efficiently without needing supercomputers with infinite memory, all while keeping the data structure perfectly intact.

1. Problem Statement

Grammar-based compression, specifically the RePair algorithm, is a powerful offline method for discovering hierarchical structures in text by iteratively replacing the most frequent adjacent symbol pairs (bigrams) with non-terminals. While RePair produces highly compact grammars with strong theoretical guarantees (e.g., optimality for Fibonacci strings), it suffers from severe scalability issues:

Memory Usage: Standard RePair requires loading the entire input text into memory, often consuming several times the size of the input. This makes it infeasible for massive datasets (e.g., genomic collections or web-scale data).
Scalability vs. Structure: Existing scalable alternatives (e.g., BigRePair and Re2Pair) use preprocessing steps like rsync-style chunking or recursive prefix-free parsing to reduce memory. However, these methods impose an artificial structure on the grammar based on the initial parsing boundaries rather than actual substring frequencies. Consequently, they fail to detect frequent substrings that span phrase boundaries, resulting in grammars that are structurally distinct from true RePair grammars and less interpretable.

The core challenge is to construct an exact RePair grammar (preserving its combinatorial properties and structural fidelity) while achieving the memory efficiency of online, reference-based parsing methods.

2. Methodology: RLZ-RePair

The authors propose RLZ-RePair, a hybrid algorithm that combines Relative Lempel-Ziv (RLZ) parsing with the RePair compression strategy.

Core Concept

Instead of processing the raw input text, RLZ-RePair first parses the input text $T$ relative to a reference string $R$ using RLZ. This decomposes $T$ into a sequence of phrases, where each phrase is a pointer $(p, \ell)$ indicating a substring of length $\ell$ starting at position $p$ in $R$ .

Non-Explicit Phrases: These are phrases that reference intervals in $R$ . They are stored as logical intervals $(s_i, e_i)$ rather than explicit strings.
Explicit Phrases: These are uncompressed boundary characters introduced when a bigram replacement would invalidate a non-explicit phrase's interval.

Algorithm Workflow

RLZ Parsing: The input $T$ is parsed against a reference $R$ to generate a sequence of phrases.
Frequency Calculation: Bigram frequencies are computed across the entire structure, including bigrams within phrases, across phrase boundaries, and within the reference $R$ itself.
Iterative Replacement (RePair): The algorithm repeatedly selects the most frequent bigram and replaces it with a new non-terminal.
- Within Phrase Boundaries: If a bigram occurs entirely within the reference $R$ (and thus within all non-explicit phrases referencing it), the replacement is performed only on $R$ . Because non-explicit phrases are logical intervals of $R$ , the change automatically propagates to all instances in $T$ without modifying $T$ directly. This drastically reduces the number of operations.
- Boundary Constraints: If a bigram spans the boundary of two phrases or partially overlaps a phrase interval in $R$ $R$ , the algorithm performs "preprocessing" to maintain invariants:
  - Phrase Boundary Condition: If a bigram crosses the boundary between two phrases, the boundary characters are extracted and stored as an Explicit Phrase.
  - Source Boundary Condition: If a bigram overlaps the start or end of a phrase interval in $R$ , the overlapping characters are extracted into an Explicit Phrase.
- Reference Management: The reference $R$ is stored as a doubly linked list embedded in an array. This allows for efficient deletion of characters (merging) without shifting the entire array, preserving the logical integrity of the phrase intervals.

Key Technical Innovation

The algorithm leverages the fact that in highly repetitive datasets, the reference $R$ is much smaller than $T$ . By performing replacements on $R$ rather than $T$ , the algorithm reduces memory usage to be proportional to the size of the reference (plus the overhead of the parse) rather than the full input size.

3. Key Contributions

Exact RePair Construction: RLZ-RePair is one of the first scalable methods to construct a grammar that is structurally equivalent to the standard RePair grammar, unlike BigRePair or Re2Pair which produce approximations.
Memory Efficiency: It achieves a massive reduction in memory usage (demonstrated as >80% reduction) by avoiding the loading of the full input text, relying instead on the reference string and logical intervals.
Reduced Operations: By replacing bigrams in the reference $R$ rather than the full text, the number of required bigram substitutions is significantly reduced, as a single replacement in $R$ updates all corresponding occurrences in $T$ .
Handling Boundary Cases: The paper provides a rigorous framework for handling bigrams that cross phrase boundaries or overlap reference intervals, ensuring the grammatical integrity of the output.

4. Experimental Results

The authors evaluated RLZ-RePair on two large biological datasets: 400,000 SARS-CoV-2 genomes and 1,024 human chromosome 19 assemblies.

SARS-CoV-2 Dataset (11.93 GB)

Memory: RLZ-RePair (0.5% reference) used 17.17 GB of RAM, whereas the standard RePair (large_bal) required 99.88 GB. This represents an 82.8% reduction in memory.
Runtime: RLZ-RePair was only 27.5% slower than RePair (4,942s vs. 3,875s).
Compression Quality: RLZ-RePair produced a compressed file of 20.48 MB, identical to RePair. In contrast, BigRePair and Re2Pair produced larger files (24.64 MB and 34.88 MB, respectively) with significantly more rules, confirming they do not achieve the same compression efficiency.

Chromosome 19 Dataset (60.54 GB)

Scalability: Standard RePair failed to compress the full 1,024-sequence dataset within 24 hours or 100 GB RAM (thrashing occurred). RLZ-RePair successfully compressed the full dataset using 31–41 GB of RAM.
Comparison: On the largest subset RePair could handle (256 sequences), RLZ-RePair used 83.1% less memory while being 34.5% slower.
Grammar Quality: For the 256-sequence subset, RLZ-RePair achieved the same compression ratio (0.45%) and rule count as RePair, significantly outperforming BigRePair and Re2Pair in terms of grammar compactness.

5. Significance

Theoretical Fidelity: The work bridges the gap between theoretical grammar compression (RePair) and practical scalability. It proves that one does not need to sacrifice the "exactness" of RePair to handle massive datasets.
Practical Application: The method is particularly valuable for domains with high repetitiveness, such as genomics (where sequences are highly similar) and version control systems.
Future Directions: The authors note that performance is sensitive to the choice of the reference string. Future work could focus on systematic reference selection (e.g., using clustering or specific heuristics) to further optimize phrase lengths and reduce the number of explicit boundary characters.

In conclusion, RLZ-RePair offers a practical, scalable solution for grammar-based compression that retains the theoretical elegance and optimality of RePair while overcoming its primary limitation: memory consumption.