Zero-Cost NDV Estimation from Columnar File Metadata

Imagine you are the manager of a massive, high-speed library (a database) where books are stored in huge, sealed boxes called Row Groups. You need to plan a big event (a query) and need to know: "How many unique titles are in this entire collection?"

Usually, to find the answer, you'd have to open every single box, read every book, and count them. That takes forever and wastes energy.

This paper presents a clever trick: You can guess the number of unique titles just by looking at the labels on the outside of the boxes, without ever opening them.

Here is how the author, Claude Brisson, explains this "Zero-Cost" magic using two main clues found on the box labels.

Clue #1: The "Backpack Weight" Trick (Dictionary Inversion)

Imagine that inside each box, the books aren't just stacked randomly. Instead, the librarian has made a dictionary (a master list of all unique titles) and replaced every book with a tiny number (an index) pointing to that list.

The Logic: If you know the total weight of the box (the file size) and the average weight of a single book, you can mathematically reverse-engineer how many unique titles must be in the dictionary to make that weight add up.
The Catch: This works best if the books in every box are a mix of all the different titles in the library. If every box contains a random mix of the whole collection, the weight calculation is very accurate.
The Metaphor: It's like weighing a bag of mixed nuts. If you know the average weight of a peanut and the total weight of the bag, you can guess how many peanuts are inside. But if the bag only contains peanuts and no cashews, your guess about the "total variety" of nuts in the whole store might be wrong.

Clue #2: The "Extreme Weather" Trick (Min/Max Diversity)

Now, imagine the librarian also writes down the coldest and hottest temperature recorded in each box on the outside label.

The Logic: If the library is organized by season (sorted data), Box 1 might have "Winter" temperatures, Box 2 "Spring," and so on. By counting how many different "coldest" and "hottest" labels appear across all boxes, you can guess how many unique seasons (or values) exist in the whole library.
The Math: The paper uses a famous math problem called the Coupon Collector Problem. It's like asking: "If I collect one coupon from every box, how many total types of coupons are there in the whole world?"
The Catch: This works great if the boxes are sorted (like a timeline). But if the boxes are a random mix, the "coldest" and "hottest" labels might all look the same, making you think there are fewer unique values than there really are.

The "Smart Switch" (Distribution Detector)

The author realized that neither trick works perfectly all the time.

Trick #1 fails if the data is sorted (because the "weight" looks too uniform).
Trick #2 fails if the data is mixed up (because the "extremes" look too similar).

So, the paper introduces a Traffic Cop. Before making a guess, the system looks at the labels to see: "Are the boxes sorted like a timeline, or are they a random mix?"

If it's a random mix, it trusts the Weight Trick.
If it's sorted, it trusts the Extreme Weather Trick.
If it's a mix of both, it takes the higher of the two guesses to be safe.

Why Does This Matter?

In the world of big data (like the GPU engines mentioned in the paper), knowing the number of unique items helps the computer decide:

How much memory to grab: Don't grab a truck if you only need a bicycle.
How to join tables: If you know there are only 5 unique customer IDs, you can process them instantly. If there are 5 million, you need a different strategy.

The "Zero-Cost" Promise

The most exciting part is that this requires no extra work.

No extra storage: You aren't saving a new file.
No data access: You aren't opening the boxes.
No waiting: You just read the tiny metadata labels that are already there.

The Tragic Twist

The paper ends with a sad note: The author built this system at a company called VoltronData, and it worked beautifully in the real world. However, when the company shut down and its assets were sold off, the actual code and test results were lost. This paper is the author's attempt to rebuild the invention from memory, proving that the math still holds up even without the original data.

In short: This paper teaches us how to guess the size of a crowd just by looking at the shadows cast by the people, without ever needing to count the people themselves.

1. Problem Statement

In columnar file formats like Apache Parquet, the Number of Distinct Values (NDV) (cardinality) of a column is a critical metric for cost-based query optimization (e.g., join ordering, aggregate pushdown, and GPU memory allocation). However, existing metadata rarely contains accurate NDV counts because:

Computing exact distinct counts is computationally expensive.
Most data writers omit the distinct_count field to avoid this overhead.
Alternative methods like sampling or maintaining HyperLogLog sketches require data access or additional writer-side infrastructure, violating the goal of metadata-only planning.

The paper addresses the challenge of estimating NDV without accessing data pages and without extra storage, using only existing file metadata.

2. Methodology

The proposed approach exploits two complementary signals implicitly encoded in Parquet metadata: Dictionary-encoded storage size and Row Group Min/Max statistics. A lightweight distribution detector routes between these two estimators based on the data layout.

A. Dictionary Size Inversion (For "Well-Spread" Data)

Premise: In dictionary encoding, the total uncompressed size ( $S$ ) of a column chunk is a function of the number of distinct values ( $ndv$ ), the mean value length ( $len$ ), the row count ( $N$ ), and null count.
Equation: $S = ndv \times len + (N - nulls) \times \lceil \log_2(ndv) \rceil / 8$ .
Solution: The method inverts this equation to solve for $ndv$ . Since the equation is non-linear due to the logarithmic term, it uses the Newton-Raphson method for rapid convergence (typically 5–10 iterations).
Parameter Estimation: The mean value length ( $len$ ) is estimated by averaging the byte lengths of all distinct min/max values observed across row groups.
Fallback: If the estimated $ndv$ approaches the total row count (indicating plain encoding rather than dictionary encoding), the estimate is treated as a lower bound.

B. Min/Max Diversity Estimation (For "Sorted/Partitioned" Data)

Premise: Row groups store min/max statistics. For sorted or partitioned data, these extrema act as implicit samples of the global value distribution.
Model: The method models the collection of distinct min (or max) values across $n$ row groups as a Coupon Collector problem.
Equation: $E[m] = NDV \times (1 - e^{-n/NDV})$ , where $m$ is the count of distinct min/max values observed.
Solution: This equation is inverted (via Newton-Raphson) to estimate the global $NDV$ based on the observed diversity of extrema.
Strength: This method excels when data is sorted or partitioned, where dictionary inversion tends to underestimate cardinality because distinct values are not well-spread across row groups.

C. Distribution Detection & Hybridization

Routing: A detector analyzes row group ranges to classify data layout:
- Well-spread: High overlap between row group ranges $\rightarrow$ Use Dictionary Inversion.
- Sorted/Pseudo-sorted: Low overlap, high monotonicity $\rightarrow$ Use Min/Max Diversity.
- Mixed: Use both estimates conservatively.
Final Estimate: The system takes the maximum of the two estimates, bounded by the number of non-null rows and type-specific constraints (e.g., integer range limits).

3. Key Contributions

Closed-Form NDV Estimation: A novel method to derive NDV by inverting the dictionary storage size equation using Newton-Raphson iteration.
Implicit Cardinality Sketches: The recognition that row group min/max statistics function as implicit sketches, recoverable via Coupon Collector inversion.
Adaptive Routing: A lightweight detector that classifies data distribution (sorted vs. well-spread) to select the most accurate estimator.
Batch Memory Prediction: An extension of the Coupon Collector model to predict GPU dictionary memory requirements for specific batch sizes without reading data.
Zero-Cost Implementation: The technique requires only metadata parsing ( $O(n)$ time, $O(1)$ space), making it suitable for high-performance query engines.

4. Results and Evaluation

Deployment: The technique was implemented in Theseus, a GPU-accelerated distributed query engine at VoltronData.
Accuracy:
- On well-spread production datasets, errors were typically below 10%.
- On sorted datasets, the hybrid approach (combining both estimators) effectively corrected the systematic underestimation caused by dictionary inversion alone.
Performance: All operations are single-pass over metadata, ensuring negligible overhead during query planning.
Limitations: The original implementation and experimental data were lost due to the liquidation of VoltronData; the paper reconstructs the approach from memory, with plans for future public benchmark reproduction.

5. Significance

Query Optimization: Enables accurate cost-based decisions (e.g., join ordering, aggregate pushdown) in distributed systems where data scanning is expensive.
Resource Management: Crucial for GPU memory allocation, allowing engines to pre-allocate dictionary buffers efficiently without over-provisioning.
Format Agnosticism: While demonstrated on Parquet, the technique generalizes to any columnar format (e.g., ORC, F3) that supports dictionary encoding and partition-level statistics.
Zero-Overhead: It eliminates the need for data access or writer-side infrastructure changes to obtain cardinality statistics, bridging a critical gap in metadata-driven query optimization.