Unifying On- and Off-Policy Variance Reduction Methods

This paper unifies online A/B testing and off-policy evaluation by proving that standard variance reduction methods in both domains are mathematically equivalent, specifically showing that Difference-in-Means estimators correspond to optimal Inverse Propensity Scoring and that regression adjustment techniques align with Doubly Robust estimation.

Olivier Jeunen

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are a chef trying to decide if a new secret spice makes your soup taste better. You have two ways to test this:

  1. The "Live" Test (Online): You cook two batches of soup right now. One batch gets the spice, the other doesn't. You serve them to customers and ask, "Which one did you like?"
  2. The "Archive" Test (Offline): You can't cook new soup right now (maybe the kitchen is closed). Instead, you look at your old logbooks from last week. You see what customers ordered and how they rated it. You try to guess, "If we had added the spice to those specific orders, would the ratings have been higher?"

For years, the people who do the Live Tests and the people who do the Archive Tests have been speaking different languages, using different tools, and even building their own separate kitchens. They both want to answer the same question: "Did the change actually help?" but they think they are solving two completely different problems.

This paper is like a translator who walks into both kitchens and says: "Stop! You are actually using the exact same recipe, just with different ingredients."

Here is the breakdown of the paper's two big discoveries, explained with simple analogies:

1. The "Difference in Means" vs. The "Magic Scale"

The Old View:

  • Online Chefs simply take the average rating of the "Spice" group and subtract the average rating of the "No Spice" group. This is called Difference-in-Means.
  • Offline Archivists use a complex math trick called Inverse Propensity Scoring (IPS). Imagine they have to weigh every old log entry differently because some people were more likely to order soup than others. It feels heavy and complicated.

The Paper's Revelation:
The author proves that if you take the Offline Archivist's complex "Magic Scale" (IPS) and give them a specific, perfectly tuned "counter-weight" (called a control variate), their calculation becomes mathematically identical to the Online Chef's simple average.

The Analogy:
Think of the Online Chef as someone weighing two apples on a scale.
Think of the Offline Archivist as someone trying to guess the weight of those apples by looking at a blurry photo of them in a bag.
The paper says: "If you give the Archivist a specific, pre-calculated weight to subtract from their guess, their blurry-photo math turns into the exact same number as the Chef's scale."
Takeaway: The "Live" and "Archive" methods aren't different; they are just different ways of writing the same equation.

2. The "Regression Adjustment" vs. The "Double-Proof"

The Old View:

  • Online Chefs have gotten smarter. They know that some customers are just "super-hungry" and give high ratings regardless of the spice. So, they use a tool called CUPED or ML-RATE. They build a model to predict how hungry a customer is, and they subtract that "hunger factor" from the rating to get a cleaner result.
  • Offline Archivists use a tool called Doubly Robust (DR) estimation. This is a fancy method that combines the "Magic Scale" with a prediction model to make sure the answer is right even if one part of the math is slightly wrong.

The Paper's Revelation:
The author shows that the Online Chef's "Hunger Factor" tool (Regression Adjustment) is actually the exact same thing as the Offline Archivist's "Double-Proof" tool, provided the model doesn't care about which specific action was taken, but just the context (the customer).

The Analogy:
Imagine you are trying to predict a student's test score.

  • Method A (Online): You look at the student's past grades and subtract the "expected" score to see how much the new study method helped.
  • Method B (Offline): You use a super-complex formula that combines the past grades with a weighted average of all possible study methods.
    The paper says: "If you simplify the complex formula so it only cares about the student's background and not the specific study method, it collapses into the exact same math as Method A."

Why Does This Matter? (The "Aha!" Moment)

1. Breaking Down the Walls:
For a long time, researchers in "Online A/B Testing" and "Offline Evaluation" didn't talk to each other. They thought they were in different fields. This paper proves they are neighbors who have been building separate fences around the same house. Now, they can share tools.

2. Fixing a Hidden Bug (The Degrees of Freedom):
The paper found a tiny, subtle mistake in how people calculate "confidence" (how sure they are in their results).

  • The Issue: When you estimate a number from data, you "lose" a little bit of certainty (a degree of freedom).
  • The Fix: The Online chefs have been doing this correctly for a long time. The Offline archivists, when using the new "Magic Scale" method, were accidentally forgetting to subtract that extra bit of uncertainty.
  • The Result: By applying the Online chefs' rule to the Offline archivists, the math suddenly matches perfectly. It's like realizing you were measuring a room in inches but calculating the area in feet—you get the right answer once you switch to the same ruler.

Summary

This paper is a unification. It tells us that Online Experiments (running live tests) and Offline Experiments (analyzing old logs) are not two different beasts. They are the same beast wearing two different masks.

  • The simple "Average Difference" used online is secretly a complex "Weighted Average" used offline.
  • The "Prediction Models" used to clean up online data are secretly the "Double-Proof" models used offline.

By realizing this, engineers and scientists can stop reinventing the wheel. They can take the best variance-reduction tricks from one world and instantly apply them to the other, making their experiments faster, cheaper, and more accurate.