Weighted Reservoir Sampling With Replacement from Data Streams

Imagine you are running a massive, never-ending parade. Thousands of people (data) are marching past you every second, each carrying a sign with a number on it (a "weight"). Some people are famous celebrities (high weight), while others are just regular folks (low weight).

Your job is to keep a small, fixed-size "snapshot" of this parade in a photo album (the Reservoir) that always represents the crowd you've seen so far. The catch? You need to pick people for your album proportionally to their fame. A celebrity should appear in your album more often than a regular person.

The Problem: The Old Ways

For a long time, computer scientists had two main ways to do this, but both had flaws:

The "No-Repeat" Rule (Without Replacement): Most existing methods said, "Once you put a person in the album, you can't put them in again." This is great for saving space, but it breaks the rules for certain statistical tricks (like the "Bootstrap" method) that require you to be able to pick the same person multiple times to get an accurate picture.
The "Slow & Steady" Rule (With Replacement): There were methods that allowed repeats, but they were incredibly slow. Imagine checking every single person in the parade one by one to decide if they belong in your album. If the parade has 10 million people, your computer gets tired and slows down to a crawl.

The New Solution: WRSWR-SKIP

The authors of this paper, Adriano Meligrana and Adriano Fazzone, invented a new method called WRSWR-SKIP. Think of it as a super-fast, skipping skipper.

Here is how it works, using a simple analogy:

1. The "Skip" Mechanism (The Magic Rope)

Imagine you have a rope that gets longer as more people join the parade.

The Old Way: You stop and check every single person against the rope to see if they are long enough to be picked.
The New Way (WRSWR-SKIP): You don't check everyone! You calculate a "skip distance." You know that for the next 500 people, the rope isn't long enough to catch any of them. So, you skip right over them! You only stop to check the person when the rope finally gets long enough to catch them.

This is the "Skip" in the name. It allows the computer to ignore millions of people instantly without wasting time checking them, only stopping when a "winner" is mathematically guaranteed to appear.

2. The "With Replacement" Twist

When the algorithm finally stops at a person (because the rope caught them), it doesn't just add them to the album. It asks: "How many copies of this person should we add?"

If the person is very famous (high weight), the algorithm might say, "Add 5 copies of them!"
It then randomly picks 5 empty spots in the album and swaps the old photos with this new celebrity.
Because it can add multiple copies, it perfectly mimics the "With Replacement" requirement, which is crucial for advanced math and statistics.

Why is this a Big Deal?

1. It's Instantly Ready (The "Get" Operation)
In the old methods, if you wanted to see your album, you often had to do extra work (post-processing) to shuffle the photos around or fix the math.

WRSWR-SKIP: The album is always perfectly organized. You can grab it and show it to anyone at any second. It's like having a photo album that magically rearranges itself the moment a new photo is taken.

2. It's Blazing Fast (The "Add" Operation)
The authors tested this against the best existing methods.

The Result: When the parade is huge, their method is significantly faster. While other methods get slower as the crowd gets bigger, WRSWR-SKIP stays fast because it keeps skipping the boring parts.
The Analogy: Imagine reading a 1,000-page book. The old methods read every single word. WRSWR-SKIP reads the first page, realizes the next 900 pages are just filler, skips them instantly, and only reads the important chapters.

The Real-World Test

The authors didn't just talk about theory; they tested it on real data (34 million Wikipedia clicks).

Scenario: Imagine tracking which Wikipedia articles are being clicked. Some articles (like "Cat") get millions of clicks (high weight), while obscure ones get one click.
Outcome: Their new method processed these clicks faster than the competition and could instantly give a summary of the most popular articles without any lag.

Summary

This paper introduces a smart, skipping algorithm that lets computers take a weighted snapshot of a never-ending stream of data. It's like having a camera that can instantly zoom past millions of unimportant people to capture the celebrities, while ensuring the final photo album is mathematically perfect and ready to use immediately. It solves a problem that has been stuck in "slow motion" for years, making it perfect for real-time data analysis.

Here is a detailed technical summary of the paper "Weighted Reservoir Sampling With Replacement from Data Streams" by Meligrana and Fazzone.

1. Problem Definition

The paper addresses the challenge of Weighted Reservoir Sampling with Replacement (WRSWR) in the context of data streams.

Context: Data streams consist of a sequence of items arriving sequentially with unknown total size ( $N$ ). Each item $e_t$ has an associated weight $w_t$ .
Goal: Maintain a fixed-size reservoir ( $\mathcal{R}$ ) of size $m$ such that at any point in time, the probability of any item $e_i$ being in the reservoir is proportional to its weight relative to the total weight of all items seen so far.
Constraint: The sampling must be with replacement, meaning the same element can appear multiple times in the sample. This is crucial for statistical tasks requiring element independence (e.g., weighted bootstrap), which sampling without replacement cannot provide directly.
Gap: Existing literature heavily favors without-replacement sampling. While methods for with-replacement exist (e.g., Chaudhuri et al., Park et al.), they often suffer from suboptimal performance due to a lack of "weight skipping" techniques, leading to high computational costs when processing large streams.

2. Methodology: WRSWR-SKIP

The authors propose WRSWR-SKIP, a novel one-pass algorithm designed to efficiently handle weighted streams with replacement.

Core Algorithm Logic

The algorithm maintains a reservoir $\mathcal{R}$ and a cumulative weight $W$ . It utilizes a skip mechanism to avoid processing every single item individually, thereby reducing the number of random variates required.

Initialization: The reservoir is initialized with $m$ copies of the first stream element. A skip threshold $W_{skip}$ is generated based on the current cumulative weight and a random uniform variable.
Streaming Loop: For each new item $(e_t, w_t)$ $(e_{t}, w_{t})$ :
- Update the cumulative weight $W \leftarrow W + w_t$ .
- Skip Check: If $W < W_{skip}$ , the algorithm skips the item entirely (no update to the reservoir). This simulates the probability of rejecting a sequence of items without processing them.
- Update Trigger: If $W \geq W_{skip}$ $W \geq W_{s k i p}$ , the algorithm performs an update:
  - Generate a new $W_{skip}$ for the next potential update.
  - Determine the number of copies $k$ of the current item $e_t$ to insert into the reservoir. $k$ is drawn from a truncated Binomial distribution $B_{>0}(m, w_t/W)$ . The truncation ensures $k \geq 1$ (at least one replacement occurs).
  - Insert $e_t$ into $k$ distinct, randomly chosen positions in the reservoir.

Theoretical Proofs

Correctness: The authors provide a formal proof by induction showing that at any step $N$ , the probability of any specific slot in the reservoir containing item $e_i$ is exactly $w_i / W_N$ . The skip mechanism is proven to be mathematically equivalent to the unoptimized version where every item is considered, ensuring the statistical properties are preserved.
Efficiency: The expected number of random variates required to process $N$ items is $O(m \log(W_N/w_1))$ . This is a significant improvement over naive approaches that require $O(N)$ operations.

3. Key Contributions

Novel Algorithm (WRSWR-SKIP): The first specialized, efficient algorithm for weighted reservoir sampling with replacement that incorporates a weight-skipping technique.
Optimal Complexity:
- Add Operation (Update): $O(m \log(W_N/w_1))$ expected random variates. This avoids linear dependence on the stream length $N$ .
- Get Operation (Retrieval): $O(1)$ . The reservoir is always ready for immediate use without post-processing.
Formal Verification: Rigorous mathematical proofs establishing both the unbiased nature of the sample and the efficiency bounds.
Practical Implementation: The algorithm is implemented in Julia and made publicly available, demonstrating feasibility in real-world scenarios.

4. Experimental Results

The authors evaluated WRSWR-SKIP against three baselines: WRSWR (Chaudhuri et al.), WRSWR-BIN (Park et al.), and WRAExp-J (Shekelyan et al., based on Efraimidis-Spirakis).

Datasets:
- Synthetic streams ( $N=10^7$ ) with decreasing, constant, and increasing weight distributions.
- Real-world Wikipedia Clickstream dataset ( $N=34$ million items).
Performance Metrics: Average time for Add (processing an item) and Get (retrieving the sample).
Findings:
- Add Performance: WRSWR-SKIP consistently outperforms WRSWR-BIN. It is comparable to WRAExp-J for small reservoir sizes but scales significantly better as $m$ increases. WRAExp-J suffers from $O(\log m)$ update costs due to priority queue operations, whereas WRSWR-SKIP uses constant-time array updates.
- Get Performance: WRSWR-SKIP and WRSWR-BIN achieve constant $O(1)$ retrieval time. In contrast, WRAExp-J exhibits linear $O(m)$ growth in retrieval time because it requires post-processing to convert a without-replacement sample to a with-replacement one.
- Scalability: On the 34M item Wikipedia dataset, WRSWR-SKIP maintained the lowest execution time for updates and provided instant retrieval, confirming its suitability for high-throughput streaming.

5. Significance

This work fills a critical gap in the literature by providing a theoretically sound and practically efficient solution for weighted sampling with replacement in streaming environments.

Statistical Utility: It enables the direct application of statistical methods (like the weighted bootstrap) on streaming data without the computational overhead of transforming samples or performing post-processing.
Efficiency: By combining the skip technique with the with-replacement constraint, it achieves a balance of low update latency and instant sample availability, making it superior to existing state-of-the-art methods for large-scale data stream applications.