Variance reduction combining pre-experiment and in-experiment data

Imagine you are a detective trying to solve a mystery: Did a new feature on a website actually make people buy more things?

To find out, you run a test (an A/B test). You show the new feature to half your visitors (the "Treatment" group) and the old version to the other half (the "Control" group). Then, you compare the sales.

The problem? Human behavior is messy. Some people are just naturally big spenders; others are bargain hunters. Some visit at 2 AM when they are tired; others visit at noon when they are energetic. This "noise" makes it hard to see if the new feature actually worked or if the results were just luck.

To get a clearer picture, you need to reduce the noise (variance). If you can't get more people to join the test (which costs money), you have to make the data you already have sharper.

The Old Way: Looking in the Rearview Mirror

For years, companies used a clever trick called CUPED. Think of this as looking in the rearview mirror.

Before the test even starts, you look at a user's history: How much did they spend last month? How many items did they view last week? You use this past data to predict how they should have performed. Then, you adjust the test results based on that prediction.

The Good: It helps smooth out the noise.
The Bad: The rearview mirror only shows you where you were, not where you are going. If a user had a quiet month last year but is suddenly excited today, the rearview mirror misses it. The prediction isn't perfect.

The Trap: The "Mediator" Trap

You might think, "Why not just look at what they did during the test? Like, how many items they added to their cart right now?"

That seems logical, but it's a trap!
Imagine the new feature is a bright red "Buy Now" button.

The button makes people click more (Treatment).
Clicking more leads to more items in the cart (The "In-Experiment" data).
More items in the cart leads to more sales (The Outcome).

If you try to "adjust" for the items in the cart, you are accidentally erasing the effect of the button. You are saying, "Well, they bought more because they put more in the cart," but you forgot that the button caused them to put more in the cart! This is called bias. It's like trying to measure how much a fertilizer helped a plant grow, but then adjusting for the fact that the plant is now taller. You'd conclude the fertilizer did nothing!

The New Solution: The "Side-Door" Strategy

This paper proposes a brilliant new framework that combines the Rearview Mirror (past data) with a specific type of Side-Door (current data).

The authors realized that not all "during the test" data is a trap. Some data is just noise that happens to be very predictive, but isn't caused by the treatment.

The Analogy: The Rainy Day Commute
Imagine you are testing a new traffic app (Treatment) to see if it gets people to work faster (Outcome).

The Trap: You look at "Time spent at traffic lights." The app might change the route, which changes the time at lights. If you adjust for this, you hide the app's success.
The Safe Data: You look at "The color of the sky" or "The number of birds flying overhead" during the test.
- Does the traffic app change the color of the sky? No.
- Does the traffic app change the number of birds? No.
- But, if it's raining (blue sky), everyone drives slower. If it's sunny, everyone drives faster.

The "Sky Color" is a post-treatment variable (you see it during the test), but it is treatment-insensitive (the app didn't change it). It is also highly predictive (rain slows everyone down).

How the Paper's Method Works

The authors built a two-step system to find these "Safe Sky Colors":

Step 1: The Rearview Mirror (CUPAC). First, they use the standard method (Machine Learning) to predict sales based on past history. This gets rid of the "old" noise.
Step 2: The Safe Side-Door (The New Trick). They look at what users are doing right now (like "time spent on the page" or "number of clicks").
- They run a quick statistical test: "Did the Treatment group and Control group have different average values for this metric?"
- If the answer is YES: It's a trap (the treatment changed it). Discard it.
- If the answer is NO: It's safe! The treatment didn't change it, but it's still very good at predicting the final result. Keep it.

They then add this "Safe Side-Door" data into the equation. Because the treatment didn't change it, adding it doesn't create bias. But because it's so predictive, it acts like a super-powerful noise-canceling headphone, making the final result much clearer.

Why This Matters

It's Safer: You don't have to guess which data is safe. The paper gives a mathematical rule to test it.
It's Stronger: "During the test" data is often much more relevant than "past" data. By using the safe parts of it, you get a much sharper signal.
It's Practical: It works with the tools companies already use. You don't need to rebuild your entire system; you just add this second step.

In a nutshell:
The paper teaches us how to use the "live" data from an experiment to make our results more precise, without accidentally deleting the very effect we are trying to measure. It's like wearing noise-canceling headphones that filter out the static of human behavior, letting you hear the true sound of your new feature's success.

1. Problem Statement

Online controlled experiments (A/B testing) are critical for data-driven decision-making, but their sensitivity is often limited by fixed sample sizes. The primary goal is to estimate the Average Treatment Effect (ATE) with high precision.

Current State: Standard variance reduction techniques like CUPED (Controlled-experiment Using Pre-Experiment Data) and CUPAC (Control Using Predictions as Covariates) utilize pre-experiment covariates ( $X$ ) to reduce the variance of the ATE estimator via regression adjustment.
Limitation: The effectiveness of CUPED/CUPAC is capped by the predictive power of pre-experiment data. In many cases, pre-experiment features have weak correlations with the outcome measured during the experiment.
The Challenge: Data collected during the experiment (in-experiment data or post-treatment covariates, $Z$ ) are often much more strongly correlated with the outcome. However, standard causal inference theory warns against adjusting for post-treatment variables because they may act as mediators (lying on the causal path from treatment $W$ to outcome $Y$ ). Adjusting for mediators blocks part of the treatment effect, introducing bias into the ATE estimate.
Core Question: How can one safely leverage highly predictive in-experiment data to reduce variance without introducing post-treatment bias?

2. Methodology

The authors propose a two-stage framework that combines pre-experiment data (via CUPAC) with a carefully selected subset of in-experiment data.

A. Theoretical Framework

The method relies on the distinction between mediators and treatment-insensitive covariates:

Mediators: Variables where $Z(1) \neq Z(0)$ (the treatment changes the covariate). Adjusting for these biases the ATE.
Treatment-Insensitive Covariates: Variables where the treatment does not affect the distribution of $Z$ , or specifically, where the mean equivalence holds: $E[Z | W=1] = E[Z | W=0]$ .
Key Insight: While full distributional independence is required for nonlinear adjustments, linear regression adjustment only requires mean equivalence. This is a weaker, testable condition that allows for the inclusion of variables that might have different higher-order moments across arms but do not shift the mean.

B. The Estimator

The proposed estimator, $\hat{\tau}$ , operates in two stages:

Stage 1 (Pre-experiment Adjustment): Fit a flexible machine learning model (e.g., LightGBM) to predict the outcome $Y$ using pre-treatment covariates $X$ . This yields a fitted function $\hat{f}(X)$ and residuals $\hat{R}_i = Y_i - \hat{f}(X_i)$ . This step is identical to CUPAC.
Stage 2 (In-experiment Adjustment): Regress the residuals $\hat{R}$ on a selected subset of post-treatment covariates $Z$ using a linear model.
$\hat{\tau} = \frac{1}{n_1}\sum_{W_i=1} (Y_i - \hat{f}(X_i) - \hat{\gamma}^\top Z_i) - \frac{1}{n_0}\sum_{W_i=0} (Y_i - \hat{f}(X_i) - \hat{\gamma}^\top Z_i)$
Where $\hat{\gamma}$ is the estimated coefficient vector for $Z$ .

C. Covariate Selection Strategy

To ensure unbiasedness, the method employs a rigorous selection procedure for $Z$ :

Hypothesis Testing: For each candidate post-treatment covariate, perform a two-sample test (e.g., Mann-Whitney U or t-test) to test the null hypothesis $H_0: E[Z | W=1] = E[Z | W=0]$ .
Screening: Select covariates where the null hypothesis cannot be rejected (p-value > $\alpha$ ).
Robustness: The authors discuss using meta-analysis (e.g., Fisher's method) across multiple historical experiments to identify a stable set of treatment-insensitive covariates. They also note that while false inclusion introduces bias, false exclusion only reduces efficiency; thus, conservative selection is preferred.

D. Theoretical Guarantees

Consistency: The estimator is consistent and asymptotically normal.
Bias: The bias is determined solely by the mean imbalance of the selected $Z$ . If mean equivalence holds, the estimator is unbiased.
Variance: The asymptotic variance is reduced by the amount of variance in the outcome explained by $Z$ that was not already explained by $X$ .
Efficiency: If the combined adjustment ( $f(X) + \gamma^\top Z$ ) matches the true conditional mean, the estimator attains the semiparametric efficiency bound.

3. Key Contributions

Novel Framework: Introduces a general, scalable method to combine pre-experiment and in-experiment data for variance reduction, moving beyond the limitations of CUPED/CUPAC.
Relaxed Assumptions: Demonstrates that linear adjustment only requires mean equivalence for post-treatment covariates, rather than the stronger assumptions of surrogacy or principal ignorability required by other post-treatment methods.
Practical Selection Procedure: Provides a statistically rigorous, testable protocol (hypothesis testing + domain knowledge) to identify "safe" post-treatment covariates, bridging the gap between theoretical causal inference and industrial A/B testing.
Computational Efficiency: The method is computationally lightweight (linear regression in the second stage) and compatible with existing pipelines where the first-stage model is pre-trained offline.

4. Empirical Results

The authors evaluated the method on 29 online experiments conducted at Etsy, focusing on customer conversion rates.

Setup:
- Baseline: CUPAC using 117 pre-treatment covariates (trained via LightGBM).
- Proposed: CUPAC + 23 selected post-treatment covariates (count data like session duration, views, etc.).
Findings:
- Predictive Accuracy: The proposed method showed consistent improvements in predictive accuracy (measured by $\sqrt{R^2}$ ) over CUPAC, with gains ranging from 0.02 to over 0.14.
- Variance Reduction: The method achieved substantial additional variance reduction compared to CUPAC alone. In many experiments, the additional reduction achieved by the post-treatment covariates was comparable to or greater than the reduction achieved by the pre-treatment covariates.
- Efficiency: Despite using significantly fewer covariates (23 post-treatment vs. 117 pre-treatment), the method yielded significant efficiency gains, demonstrating the high signal-to-noise ratio of in-experiment data.

5. Significance and Impact

Accelerated Decision Making: By reducing the variance of the ATE estimator, companies can detect treatment effects faster, reducing the time and cost required for A/B tests.
Universal Applicability: Unlike pre-experiment data, which may be missing for new users or restricted by privacy policies, in-experiment data is available for all participants, making this method robust even for new user segments.
Paradigm Shift: The paper challenges the industry dogma that "post-treatment variables must never be used." It provides a principled, safe pathway to utilize the rich, high-correlation data generated during the experiment itself.
Scalability: The approach fits seamlessly into large-scale experimentation platforms where thousands of experiments run in parallel, requiring minimal additional computational resources.

In summary, Lin and Crespo provide a theoretically sound and empirically validated solution to a long-standing problem in online experimentation: how to safely harness the predictive power of in-experiment data to improve the precision of causal estimates.