Design Experiments to Compare Multi-armed Bandit Algorithms

Imagine you are the manager of a massive online store, like Walmart. Every day, thousands of new products arrive, and you have to decide which ones to show to your customers. You don't know which products are the "hits" yet, so you have to experiment. This is where Multi-Armed Bandit (MAB) algorithms come in. Think of them as smart robots that try different products (the "arms" of a slot machine) to see which one makes the most money.

But here's the problem: How do you know if your new, fancy robot (let's call it Robot B) is actually better than your old, reliable robot (Robot A)?

The Old Way: The "Double Trouble" Experiment

Traditionally, to compare two robots, you would run a standard "A/B test."

You split your customers into two groups.
Group 1 talks only to Robot A.
Group 2 talks only to Robot B.
You wait for them to finish their shopping and compare the total sales.

The Flaw: This is incredibly wasteful.

Memory Loss: Robot A and Robot B are like two students studying for the same exam but in separate rooms. They can't share notes. Robot A learns from Group 1, and Robot B learns from Group 2. They build their own separate "memories."
High Cost: To get a clear answer, you need twice as many customers (2T) because you are running two separate experiments.
Noisy Results: Because the robots are learning as they go, their performance is "noisy" (unpredictable). If you run this test once, the results might be a fluke. You have to repeat the whole expensive experiment many times to be sure, which delays launching the better robot.

The New Way: "Artificial Replay" (AR)

The authors of this paper propose a clever trick called Artificial Replay. Instead of running two separate experiments, they run one, and then "rewind" the tape for the second robot.

Here is how it works, using a Restaurant Analogy:

Imagine you have two chefs, Chef A (Control) and Chef B (Treatment), trying to figure out the best dish to serve.

Phase 1 (The Real Run): You let Chef A cook for 100 customers. You write down exactly what they ordered and how much they liked it.
- Customer 1 ordered Soup, liked it 8/10.
- Customer 2 ordered Salad, liked it 5/10.
- Customer 3 ordered Soup, liked it 9/10.
Phase 2 (The Replay): Now, you want to see how Chef B would have done with those same 100 customers.
- The Trick: You don't need 100 new customers. You just ask Chef B, "What would you have served to Customer 1?"
- If Chef B says, "I would have served Soup," you don't go to the kitchen to cook a new soup. You simply look at your notes from Phase 1 and say, "Okay, since Chef A served Soup to Customer 1 and they liked it 8/10, we'll record that Chef B also served Soup and got an 8/10."
- The "Real" Interaction: You only go to the kitchen (the real environment) if Chef B decides to serve something Chef A never served to that customer.

Why is this a Game-Changer?

1. It's Like Sharing a Notebook (Variance Reduction)

In the old way, Chef A and Chef B are guessing in the dark. Their results bounce around wildly.
In the new way, because they are looking at the same customers and the same reactions, their results are linked.

If the customers were just having a "bad day" and hated everything, both chefs would get low scores.
If the customers were "happy" and loved everything, both chefs would get high scores.
Because they move together, the "noise" cancels out. You can see the true difference between the chefs much faster and with fewer customers. It's like comparing two runners on the same track at the same time, rather than running them on different tracks on different days.

2. It Saves Money (Sample Efficiency)

In the old way, you needed 200 customers to compare two robots (100 for A, 100 for B).
With Artificial Replay, you might only need 105 customers.

You run the first robot on 100 customers.
The second robot only needs to interact with the "real" environment for the few times it chooses a different path than the first robot.
Result: You cut your experiment costs nearly in half.

3. It's Fair (Symmetry)

Does it matter if you run Robot A first or Robot B first? No. The math proves that the result is the same either way. It's a fair fight.

The Big Picture

This paper is about smarter experimentation.

Before: "Let's hire two teams of 1,000 people to test our new software. It will take a month and cost a fortune, and the results might still be fuzzy."
Now (Artificial Replay): "Let's hire one team of 1,000 people. Then, we'll simulate how the second team would have acted based on the first team's data. We'll get a clear answer in half the time and with half the cost."

This method allows online platforms (like Walmart, Amazon, or Netflix) to test new algorithms much faster, cheaper, and more accurately, meaning you get better recommendations and products sooner.

Here is a detailed technical summary of the paper "Design Experiments to Compare Multi-armed Bandit Algorithms" by Meng, Chen, and Gao.

1. Problem Statement

Online platforms frequently need to compare different multi-armed bandit (MAB) algorithms (e.g., UCB vs. Thompson Sampling) to determine which performs best for dynamic decision-making tasks like product recommendations.

The Challenge: Standard A/B testing assumes static treatments and independent units. However, MAB algorithms are dynamic and history-dependent; their decisions at time $t$ depend on all past interactions.
The Naïve Approach: The standard method involves running two independent policies ( $\pi_0$ $π_{0}$ and $\pi_1$ $π_{1}$ ) on separate streams of $T$ $T$ users each.
- Inefficiency: This requires $2T$ total user interactions.
- High Variance: Because the algorithm's trajectory is a single dependent sample, the aggregate reward has high variance. To achieve statistical significance, platforms must repeat these expensive experiments many times, delaying deployment.
The Goal: Develop an experimental design that reduces the number of real-environment interactions required to compare two policies while maintaining unbiasedness and reducing the variance of the estimator.

2. Methodology: Artificial Replay (AR)

The authors propose Artificial Replay (AR), a new experimental design that couples the trajectories of two policies to share information.

The AR Protocol

The experiment proceeds in two phases:

Phase 1 (Control): Run policy $\pi_0$ for $T$ periods against the real environment. Record the full action-reward trajectory $H_{\pi_0}$ .
Phase 2 (Treatment with Replay): Run policy $\pi_1$ $π_{1}$ for $T$ $T$ periods.
- At each step $t$ , $\pi_1$ selects an arm $A_t$ .
- Replay Condition: If $\pi_0$ previously selected the same arm $A_t$ at some time $s$ and that specific historical reward has not yet been "replayed," $\pi_1$ reuses the reward $R_{\pi_0, s}$ from the recorded history.
- Real Interaction: If the arm was never pulled by $\pi_0$ , or all its historical pulls have been used, $\pi_1$ interacts with the real environment to observe a new reward.

The Estimator

The Average Treatment Effect (ATE) is estimated as:
$\hat{\theta}_{AR}(T) = \sum_{t=1}^T R_{\pi_1, t}^{AR} - \sum_{t=1}^T R_{\pi_0, t}^{AR}$
where $R_{\pi_1, t}^{AR}$ includes both real and replayed rewards.

3. Theoretical Framework

Analyzing AR is difficult because the two policies are no longer independent; the second policy's rewards depend on the first policy's history. The authors develop a novel analytical framework to handle this:

Shared-Reward-Stack Model: They introduce a probability model where a pre-generated "stack" of rewards exists for each arm. Both policies draw from this shared stack according to their own selection rules.
Distributional Equivalence: They prove (Theorem 1) that the joint distribution of the AR experiment (Canonical Model) is identical to the Shared-Reward-Stack Model. This allows them to decouple the environmental stochasticity from the policy selection logic.
Martingale & Stopping Time Analysis: They construct a specialized filtration and utilize martingale theory to analyze the variance. This is crucial for proving that the covariance between the two policies' cumulative rewards grows linearly with $T$ , effectively canceling out variance terms.

4. Key Contributions & Theoretical Results

A. Sample Efficiency

Result: The expected number of real-environment interactions in AR is $T + O(\text{Regret})$ .
Implication: If both policies have sub-linear regret (e.g., $O(\log T)$ ), the cost is $T + O(\log T)$ . This is nearly half the cost of the naïve design ($2T$).
Theorem 3: Proves the bound $n_{e-AR}(T) \leq T + n_{\pi_0}(T) + n_{\pi_1}(T)$ .

B. Unbiasedness

Result: The AR estimator is unbiased for the true ATE.
Theorem 4: $E[\hat{\theta}_{AR}(T)] = \theta(T)$ . The reuse of rewards does not introduce bias because the reward distribution remains consistent with the environment.

C. Asymptotic Variance Reduction

Result: The variance of the AR estimator grows sub-linearly with $T$ , whereas the naïve estimator's variance grows linearly with $T$ .
Theorem 5: Under mild conditions (sub-linear regret and sub-linear variance of pull counts),
$\lim_{T \to \infty} \frac{1}{T} \text{Var}(\hat{\theta}_{AR}(T)) = 0$
In contrast, $\lim_{T \to \infty} \frac{1}{T} \text{Var}(\hat{\theta}_{naive}(T)) = 2\sigma^2_{a^*}$ .
Mechanism: The shared reward stack induces a strong positive correlation between the cumulative rewards of the two policies. In the variance formula $\text{Var}(X-Y) = \text{Var}(X) + \text{Var}(Y) - 2\text{Cov}(X,Y)$ , the covariance term cancels out the leading linear variance terms.

D. Symmetry

Result: The design is symmetric. Swapping the order of $\pi_0$ and $\pi_1$ yields an estimator with the same distribution, ensuring fair comparison regardless of deployment order.

5. Numerical Experiments

The authors validate their theory using UCB, Thompson Sampling (TS), and $\epsilon$ -greedy policies:

Sample Efficiency: In all cases, AR required significantly fewer real interactions (close to $T$ ) compared to the naïve $2T$.
Variance Reduction:
- Example 1 (UCB vs. UCB): AR provided a 99% confidence interval entirely below zero (identifying the better policy), while the naïve interval spanned zero (inconclusive).
- Example 2 (UCB vs. TS): Similar results; AR detected a significant performance gap where the naïve method failed.
- Example 3 (TS vs. $\epsilon$ -greedy): Even when theoretical assumptions (sub-linear pull variance) were violated, AR still demonstrated lower variance than the naïve approach, though the reduction was less dramatic.

6. Significance and Impact

Cost Reduction: AR drastically reduces the cost of online experimentation by nearly halving the required user interactions for policy comparison.
Statistical Power: By reducing variance, AR allows platforms to detect smaller performance differences between algorithms with fewer samples, leading to faster and more reliable deployment decisions.
Theoretical Advancement: The paper provides a rigorous framework for analyzing coupled adaptive policies, moving beyond standard single-policy regret analysis. The "Shared-Reward-Stack" model and the associated martingale techniques offer new tools for the broader field of causal inference in dynamic systems.
Practical Application: This is directly applicable to e-commerce, advertising, and recommendation systems where comparing learning algorithms is a routine but costly operation.

In summary, the paper transforms the problem of comparing dynamic learning algorithms from a high-variance, high-cost endeavor into a statistically efficient process by intelligently reusing historical data through Artificial Replay.