Weighted Reservoir Sampling With Replacement from Data Streams

This paper introduces a new one-pass algorithm for weighted reservoir sampling with replacement from data streams of unknown size, which provides a representative sample at any time without post-processing and is proven to be both correct and efficient compared to state-of-the-art methods.

Adriano Meligrana, Adriano Fazzone

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you are running a massive, never-ending parade. Thousands of people (data) are marching past you every second, each carrying a sign with a number on it (a "weight"). Some people are famous celebrities (high weight), while others are just regular folks (low weight).

Your job is to keep a small, fixed-size "snapshot" of this parade in a photo album (the Reservoir) that always represents the crowd you've seen so far. The catch? You need to pick people for your album proportionally to their fame. A celebrity should appear in your album more often than a regular person.

The Problem: The Old Ways

For a long time, computer scientists had two main ways to do this, but both had flaws:

  1. The "No-Repeat" Rule (Without Replacement): Most existing methods said, "Once you put a person in the album, you can't put them in again." This is great for saving space, but it breaks the rules for certain statistical tricks (like the "Bootstrap" method) that require you to be able to pick the same person multiple times to get an accurate picture.
  2. The "Slow & Steady" Rule (With Replacement): There were methods that allowed repeats, but they were incredibly slow. Imagine checking every single person in the parade one by one to decide if they belong in your album. If the parade has 10 million people, your computer gets tired and slows down to a crawl.

The New Solution: WRSWR-SKIP

The authors of this paper, Adriano Meligrana and Adriano Fazzone, invented a new method called WRSWR-SKIP. Think of it as a super-fast, skipping skipper.

Here is how it works, using a simple analogy:

1. The "Skip" Mechanism (The Magic Rope)

Imagine you have a rope that gets longer as more people join the parade.

  • The Old Way: You stop and check every single person against the rope to see if they are long enough to be picked.
  • The New Way (WRSWR-SKIP): You don't check everyone! You calculate a "skip distance." You know that for the next 500 people, the rope isn't long enough to catch any of them. So, you skip right over them! You only stop to check the person when the rope finally gets long enough to catch them.

This is the "Skip" in the name. It allows the computer to ignore millions of people instantly without wasting time checking them, only stopping when a "winner" is mathematically guaranteed to appear.

2. The "With Replacement" Twist

When the algorithm finally stops at a person (because the rope caught them), it doesn't just add them to the album. It asks: "How many copies of this person should we add?"

  • If the person is very famous (high weight), the algorithm might say, "Add 5 copies of them!"
  • It then randomly picks 5 empty spots in the album and swaps the old photos with this new celebrity.
  • Because it can add multiple copies, it perfectly mimics the "With Replacement" requirement, which is crucial for advanced math and statistics.

Why is this a Big Deal?

1. It's Instantly Ready (The "Get" Operation)
In the old methods, if you wanted to see your album, you often had to do extra work (post-processing) to shuffle the photos around or fix the math.

  • WRSWR-SKIP: The album is always perfectly organized. You can grab it and show it to anyone at any second. It's like having a photo album that magically rearranges itself the moment a new photo is taken.

2. It's Blazing Fast (The "Add" Operation)
The authors tested this against the best existing methods.

  • The Result: When the parade is huge, their method is significantly faster. While other methods get slower as the crowd gets bigger, WRSWR-SKIP stays fast because it keeps skipping the boring parts.
  • The Analogy: Imagine reading a 1,000-page book. The old methods read every single word. WRSWR-SKIP reads the first page, realizes the next 900 pages are just filler, skips them instantly, and only reads the important chapters.

The Real-World Test

The authors didn't just talk about theory; they tested it on real data (34 million Wikipedia clicks).

  • Scenario: Imagine tracking which Wikipedia articles are being clicked. Some articles (like "Cat") get millions of clicks (high weight), while obscure ones get one click.
  • Outcome: Their new method processed these clicks faster than the competition and could instantly give a summary of the most popular articles without any lag.

Summary

This paper introduces a smart, skipping algorithm that lets computers take a weighted snapshot of a never-ending stream of data. It's like having a camera that can instantly zoom past millions of unimportant people to capture the celebrities, while ensuring the final photo album is mathematically perfect and ready to use immediately. It solves a problem that has been stuck in "slow motion" for years, making it perfect for real-time data analysis.