Huffman-Bucket Sketch: A Simple $O(m)$ Algorithm for Cardinality Estimation

Imagine you are a librarian trying to count how many unique books have been checked out of a massive library over the last year. The problem? The library is so huge that the number of books could be in the billions, and you don't have enough shelf space to write down every single title.

If you tried to keep a list of every unique book, you'd run out of space instantly. So, instead, you use a clever trick called a Sketch. Think of a sketch not as a drawing, but as a "fingerprint" of the data. It doesn't remember which books were checked out, but it remembers enough about the pattern of checkouts to give you a very good guess at the total number.

The most famous of these tricks is called HyperLogLog (HLL). It's like a super-efficient, tiny notebook that fits in your pocket. It's so good that it's used by giants like Google and Facebook. But, even this tiny notebook has a flaw: it's still a bit bulky. It takes up more space than strictly necessary because it writes down numbers in a "standard" way, like writing "100" as "1-0-0" every time, even if you could just write "100" in a shorter code.

Enter the Huffman-Bucket Sketch (HBS). This paper introduces a new way to pack that same notebook into an even smaller space without losing any information. Here is how it works, using some everyday analogies:

1. The "Zipper" Effect (Compression)

Imagine you have a bag of marbles. Most of them are red, a few are blue, and only one is gold.

The Old Way (HLL): You write down the color of every marble: "Red, Red, Red, Red, Blue, Red, Red, Gold..." This takes up a lot of paper.
The New Way (HBS): You realize that "Red" happens 90% of the time. So, you invent a secret code where "Red" is just a tiny dot (•), "Blue" is a short dash (–), and "Gold" is a long string (———). Now, your list looks like: "•, •, •, •, –, •, •, ———". It takes up way less space!

In the paper, the "marbles" are the numbers stored in the sketch (called registers). The authors noticed that in these sketches, most numbers are clustered around a specific value (like the "Red" marbles), with very few extreme outliers. They use a Huffman Code (the secret code) to compress the common numbers into tiny bits and the rare numbers into longer bits.

2. The "Bucket" Strategy (Organization)

You can't just compress a whole library at once; it's too messy. So, the authors divide the notebook into buckets (small groups of registers).

Think of a bucket as a single cache line in your computer's memory (a tiny chunk of data your processor can grab instantly).
By keeping the buckets small, the computer can process them incredibly fast, almost like snapping your fingers.

3. The "Baron Münchhausen" Trick (Self-Correction)

Here is the cleverest part. To create the perfect secret code (Huffman code), you usually need to know exactly how many books (data points) you have. But the whole point of the sketch is that you don't know the exact number yet! You are trying to find it.

So, how do you make the code?

The authors use a trick inspired by the fictional Baron Münchhausen, who pulled himself out of a swamp by his own hair.
The sketch makes a rough guess at the total number of unique items.
It uses that rough guess to build a "good enough" secret code.
As more data comes in, the sketch updates its guess. When the guess changes significantly (like when the number of items doubles), the sketch pauses for a split second, rebuilds the secret code to match the new reality, and keeps going.
This happens so rarely (only when the count doubles) that it doesn't slow anything down. It's like updating your library catalog only once a year, even though books are coming in every day.

4. Why This Matters

Space: It shrinks the memory usage to the absolute theoretical limit. It's like fitting a whole encyclopedia into a single postcard.
Speed: It updates almost instantly. The "rebuilding" of the code is so rare that, on average, every update is still lightning fast.
Merging: If you have two of these sketches (one from a server in New York and one in London), you can smash them together to get a global count. The new method keeps this "mergeability" feature, which is crucial for big data systems.

The Bottom Line

The Huffman-Bucket Sketch is like taking a standard, slightly bulky digital notebook and compressing it with the smartest possible zipper. It organizes the data into small, manageable chunks, uses a dynamic secret code that adapts as the data grows, and allows you to merge different notebooks together seamlessly.

It proves that you don't have to sacrifice speed or accuracy to save space. You can have a tiny, efficient sketch that is just as powerful as the big, clunky ones we've been using for years.

Here is a detailed technical summary of the paper "Huffman-Bucket Sketch: A Simple $O(m)$ Algorithm for Cardinality Estimation" by Matti Karppa.

1. Problem Statement

The paper addresses the cardinality estimation problem: estimating the number of distinct elements ( $n$ ) in a massive data stream using limited memory.

Context: The standard solution is the HyperLogLog (HLL) sketch. HLL uses $O(m \log \log n)$ bits for $m$ registers to achieve a relative standard error of $O(1/\sqrt{m})$ . It is highly valued for its constant-time updates and mergeability (the ability to combine sketches from different streams).
Limitation: While HLL is efficient, its space complexity is not optimal. Information-theoretic lower bounds suggest that $O(m + \log n)$ bits are sufficient for the same error rate. Previous attempts to compress HLL often sacrificed mergeability or constant-time updates.
Goal: Develop a data structure that losslessly compresses an HLL sketch to the optimal $O(m + \log n)$ bits while preserving mergeability and maintaining efficient (amortized constant-time) updates.

2. Methodology: The Huffman-Bucket Sketch (HBS)

The core innovation is the Huffman-Bucket Sketch (HBS), which leverages the statistical properties of HLL register values to apply lossless compression.

Key Concepts

Register Distribution Concentration: In an HLL sketch, the values (ranks) stored in registers are not uniformly distributed. They are highly concentrated around the mode $r^* = \lceil \log_2(n/m) \rceil$ , with rapidly decaying tails. The entropy of these values is asymptotically constant per register.
Bucketing: The $m$ registers are partitioned into $m/B$ buckets, each containing $B$ registers. The bucket size $B$ is chosen to be $O(\log n)$ (e.g., fitting within a machine word or cache line).
Global Huffman Coding:
- Instead of storing raw register values, the algorithm encodes them using a Huffman code.
- A global Huffman tree (or codebook) is constructed based on the probability distribution of register ranks.
- Crucially, this distribution is determined solely by the cardinality estimate $\hat{n}$ and the number of registers $m$ .
Dynamic Tree Reconstruction:
- Since the true cardinality $n$ is unknown, the algorithm uses the current global cardinality estimate $\hat{n}$ to approximate the distribution and build the Huffman tree.
- The tree is reconstructed only $O(\log n)$ times over the entire stream. This occurs roughly when the cardinality doubles, shifting the mode of the distribution significantly enough to change the optimal Huffman tree structure.
Data Structure Components:
- Bucket Array: Stores variable-length Huffman codewords for registers in each bucket.
- Unary Length Array: Stores the lengths of codewords within a bucket to allow fast locating of specific registers.
- Metadata: Stores the minimum rank ( $r_{min}$ ) and count of minimum ranks ( $c_{min}$ ) per bucket (to handle small cardinalities via linear counting fallback) and global cardinality estimates ( $\hat{n}, \hat{n}_{old}$ ).

Operations

Insert: Hashes an element to a bucket and register. If the new rank is higher than the current one, it updates the register. If the update changes the bucket's minimum rank, it triggers a local recalculation. If the global estimate $\hat{n}$ changes significantly from $\hat{n}_{old}$ , the Huffman tree is rebuilt, and all buckets are re-encoded.
Peek/Poke: Accesses or modifies a specific register. With $O(\log n)$ assumptions, these are $O(1)$ or $O(\log n)$ depending on implementation details (lookup tables vs. tree traversal).
Merge: Decodes registers from two sketches, takes the element-wise maximum, and re-encodes them using a new Huffman tree (or reuses the larger sketch's tree if estimates are similar).

3. Key Contributions

Optimal Space Complexity: The HBS achieves a size of $O(m + \log n)$ bits, which is theoretically optimal for cardinality estimation. This is a significant reduction from HLL's $O(m \log \log n)$ .
Preserved Mergeability: Unlike many compression techniques that break the ability to merge sketches, HBS is a drop-in replacement for HLL. It can be merged with other HBS sketches or decompressed back to a standard HLL sketch at any time.
Amortized Constant-Time Updates:
- While rebuilding the Huffman tree is expensive ( $O(m \log n)$ ), it happens infrequently ( $O(\log n)$ times over $n$ insertions).
- The paper proves that the amortized cost per insertion is $O(1)$ , provided $m$ is within a reasonable range relative to $n$ (specifically $m = O(n / \log^3 n)$ in the general case, or $O(n / \log^2 n)$ with lookup tables).
Theoretical Analysis: The paper provides rigorous proofs regarding:
- The unimodality and tail behavior of the register rank distribution.
- The bound on Huffman tree reconstruction frequency ( $O(\log n)$ ).
- The high-probability bound on the total bit size of a bucket ( $O(\log n)$ ).

4. Results and Analysis

Space Efficiency: Numerical evidence suggests that for practical bit budgets (e.g., 64 to 1024 bits per bucket), the Memory-Variance Product (MVP) of HBS is competitive with state-of-the-art sketches like ExaLogLog, even without utilizing the extra information from the FM85 matrix that ExaLogLog uses.
Tree Stability: The analysis shows that the Huffman tree structure is "rigid" in the tails. Changes only occur in a constant-width band around the mode. Since the mode shifts logarithmically with $n$ , the tree changes only $O(\log n)$ times.
Practicality: The algorithm is designed for real-world implementation, considering cache lines and machine words. The authors provide preliminary evidence that HBS is competitive with existing solutions in terms of both space and the Memory-Variance Product.

5. Significance

Bridging Theory and Practice: The paper bridges the gap between the theoretical lower bound of $O(m + \log n)$ bits and practical, mergeable sketches.
Drop-in Replacement: It offers a path to reduce memory usage in distributed systems (databases, networking, metagenomics) without sacrificing the critical property of mergeability, which is essential for distributed cardinality estimation.
Generalizability: The framework is not limited to HLL; the authors suggest it can be extended to other sketches with concentrated distributions (e.g., Count-Min Sketch, though with more complexity) or other rank functions (e.g., base- $b$ representations).

In summary, the Huffman-Bucket Sketch is a novel, theoretically optimal, and practically viable data structure that compresses HyperLogLog sketches by exploiting the concentrated nature of register rank distributions, achieving significant space savings while maintaining the mergeability and efficiency required for large-scale data stream processing.

Huffman-Bucket Sketch: A Simple O(m)O(m)O(m) Algorithm for Cardinality Estimation

1. The "Zipper" Effect (Compression)

2. The "Bucket" Strategy (Organization)

3. The "Baron Münchhausen" Trick (Self-Correction)

4. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: The Huffman-Bucket Sketch (HBS)

Key Concepts

Operations

3. Key Contributions

4. Results and Analysis

5. Significance

More like this

Monotone Comparative Statics without Lattices

Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans

Performance Analysis of IEEE 802.11p Preamble Insertion in C-V2X Sidelink Signals for Co-Channel Coexistence

Construction of time-varying ISS-Lyapunov Functions for Impulsive Systems

Real-Time BDI Agents: a model and its implementation

Huffman-Bucket Sketch: A Simple $O(m)$ Algorithm for Cardinality Estimation