Demonstration Experiments

Imagine you are a chef running a massive food festival. You have 50 different new recipes (treatments) you want to test, but you only have 2,000 hungry customers (samples) to feed them.

In the old days, scientists would use a "Uniform Design." This is like giving every customer a tiny taste of every single recipe, one by one, in a strict rotation. You'd end up with 40 bites of each recipe. It's fair, but it's inefficient. If Recipe #12 is terrible and Recipe #7 is amazing, you still wasted 40 bites on the bad one and only got 40 bites on the good one.

This paper is about a smarter way to run the festival. It's called a "Demonstration Experiment."

The Goal: "Show Me It Works!"

Usually, scientists want to know exactly how much better Recipe #7 is than Recipe #1 (estimating the effect). But in the early stages, you don't need a precise number. You just need to prove: "Hey! At least one of these recipes is actually good!"

The goal isn't to find the perfect recipe immediately; it's to find any recipe that beats a "bad taste" threshold so you can justify spending more money on a bigger study later.

The Problem: The "Strategic" Chef

The authors ask: What if the chef gets to decide who gets what based on what they've tasted so far?

"Oh, Recipe #7 tastes great! Let's give it to the next 100 people!"
"Recipe #12 tastes like mud. Let's stop feeding it to anyone."

This is called Adaptive Sampling. The problem is, if you change the rules while the game is playing, your old math tools break. If you just look at the data at the end, you might trick yourself into thinking a bad recipe is good just because you stopped testing it early.

The Solution: Two New "Magic Rulers"

The authors invented two special ways to measure the results that cannot be tricked, even if the chef is playing favorites.

1. The "Group Hug" Ruler (Pooled Statistic)

Imagine you take all the good-tasting bites from every recipe and mix them into one giant smoothie.

How it works: It looks at the total evidence across all recipes. If any recipe is truly good, it will pull the average of the whole group up.
The Analogy: It's like a team sport. Even if one player is a superstar, the team score goes up. This method is great when you think many recipes might be slightly good. It's robust and hard to fool.

2. The "Star Player" Ruler (Max Statistic)

Imagine you ignore the team score and just look at the single best player on the field.

How it works: It tracks the "t-statistic" (a measure of confidence) for each recipe individually. It asks: "Is there one specific recipe that is clearly beating the bad threshold?"
The Analogy: This is like a "Best Player" award. It's very strict. It allows you to stop the experiment early if you find a winner. However, because it's looking for a needle in a haystack, it's a bit more conservative (it doesn't want to give out the award by mistake).

The Secret Weapon: The "Signal-to-Noise" GPS

The paper also introduces a new way to decide which recipe to feed next. They call it SN-UCB.

Most chefs just look at the Average Taste (Mean).

Bad Chef: "Recipe #5 tastes 8/10, but I only tried it once. Let's try it again!" (Maybe it was just luck).
Bad Chef: "Recipe #9 tastes 7/10, but I tried it 1,000 times. It's consistent."

The SN-UCB chef looks at the Signal-to-Noise Ratio.

Signal: How good does it taste?
Noise: How much does the taste vary?

The Analogy: Imagine two runners.

Runner A runs 100 meters in 10 seconds, but sometimes runs 15 seconds and sometimes 5. (High noise).
Runner B runs 100 meters in 11 seconds, but always runs 11 seconds. (Low noise).

If you only look at the average, Runner A looks faster. But if you look at the Signal-to-Noise, Runner B is actually the more reliable bet for a race. The SN-UCB algorithm focuses on the runners who are consistently good, not just the ones who got lucky once. This helps the experiment find the truth much faster.

Why This Matters

In the real world, we do this with:

Online Shopping: "Should we show this new ad to everyone, or just people who clicked yesterday?"
Medicine: "Should we keep testing this drug on patients who aren't responding, or switch to the ones who are?"

The Takeaway:
This paper gives us a new rulebook for running experiments where we can change the rules as we go. It proves that even if we are "strategic" (giving more chances to the winners), we can still use math to prove, with high confidence, that we found a real winner. It turns the chaotic process of "trying things out" into a rigorous scientific demonstration.

Here is a detailed technical summary of the paper "Demonstration Experiments" by Imbens, Masoero, Rakhlin, Richardson, and Vijaykumar.

1. Problem Statement

The paper addresses a specific class of experimental design problems termed "demonstration experiments." Unlike traditional A/B testing, which aims to precisely estimate the magnitude of an average treatment effect, or "best-arm identification," which seeks to find the single optimal intervention, the goal here is binary hypothesis testing: determining whether at least one candidate intervention yields a positive effect (exceeding a specific threshold) for at least one subpopulation or outcome.

Key Challenges:

Adaptive Sampling: Experiments often use adaptive designs (e.g., Multi-Armed Bandits) to allocate more samples to promising arms. Standard statistical inference methods often fail under such adaptive sampling because the sampling distribution depends on the data, violating assumptions of independence required for classical $t$ -tests.
Heterogeneity: Effects may vary across arms, outcomes, or subpopulations. A uniform allocation strategy may waste resources on low-signal arms, while standard bandit algorithms (like UCB) may optimize for mean reward rather than statistical power to detect a threshold crossing.
Anytime Validity: Researchers may wish to stop the experiment early or monitor results continuously. The inference procedure must remain valid regardless of the stopping rule or the specific adaptive strategy used.

2. Methodology and Framework

The authors formalize the problem within a Multi-Armed Bandit (MAB) framework with $k$ arms and a horizon $T$ .

Hypotheses: The null hypothesis $H_0$ states that for all arms $g$ , the mean $\mu_g$ is less than or equal to a threshold $u_g$ . The alternative $H_1$ posits that at least one arm exceeds the threshold.
Assumptions:
- Sub-Gaussianity: Outcomes are sub-Gaussian.
- Initialization: Every arm is sampled at least twice initially to ensure variance estimates are defined.
- Strategic Sampling: The sampling strategy $g_t$ is $\mathcal{F}_{t-1}$ -measurable (data-dependent).

The paper develops two distinct testing procedures and a corresponding adaptive sampling algorithm.

A. Test Statistics

The authors propose two statistics robust to strategic sampling:

Pooled Statistic ( $\hat{H}_T$ ):
- Concept: Aggregates evidence across all arms into a single test statistic. It acts as a weighted sum of standardized arm means.
- Mechanism: To handle unknown variances and adaptive sampling, the authors use regularized variance estimators. They propose two regimes:
  - Padding: Inflates variance estimates for arms with few samples ( $\hat{\sigma}_{pad}$ ).
  - Thresholding: Excludes arms with insufficient samples from the sum ( $\hat{\sigma}_{thr}$ ).
- Theoretical Basis: They establish a Central Limit Theorem (CLT) for these pooled statistics under adaptive sampling, showing they converge to a standard normal distribution. This relies on the fact that under the null, the cumulative sum cannot be positive in expectation, making the statistic a supermartingale.
Max Statistic ( $A_{lin}, A_{log}$ ):
- Concept: Tests the individual hypotheses for each arm ( $H_0^{(g)}: \mu_g \le 0$ ) and rejects the global null if any individual test rejects.
- Mechanism: Uses the maximum of sequential $t$ -statistics across arms.
- Theoretical Basis: To support time-uniform inference (valid at any stopping time), the authors extend the Robbins-Siegmund boundary crossing results. They establish a Moderate Deviations Principle for the sequential $t$ -statistic. This allows for simultaneous monitoring of many arms with a large number of hypotheses ( $k$ ) relative to the sample size ( $T$ ), providing critical values that control the Type I error rate uniformly over time.

B. Adaptive Sampling Algorithm: SN-UCB

To maximize the power of these tests, the authors propose the Self-Normalized Upper Confidence Bound (SN-UCB) algorithm.

Objective: Unlike standard UCB which maximizes the estimated mean, SN-UCB targets the Signal-to-Noise Ratio (SNR), defined as $z_g = \mu_g / \sigma_g$ .
Rationale: The drift of the pooled and max statistics under the alternative hypothesis is proportional to the SNR. Maximizing SNR directly maximizes statistical power.
Algorithm:
1. Initialize by sampling each arm twice.
2. At each step, select the arm $g$ that maximizes the upper confidence bound of its SNR: $\hat{U}_g(t) = \hat{Z}_g + \text{exploration term}$ .
3. The exploration term is derived from self-normalized concentration inequalities.
Regret Bound: The authors prove a logarithmic regret bound for the pseudo-regret defined on the SNR, showing that SN-UCB efficiently identifies the arm with the highest SNR.

3. Key Contributions

Formalization of Demonstration Experiments: The paper defines a new objective in experimental design: detecting the existence of an effect rather than estimating its magnitude or finding the single best arm.
Robust Inference Procedures:
- Development of Pooled and Max statistics that remain valid under arbitrary adaptive sampling strategies (satisfying minimal initialization).
- Proof of Anytime-Valid inference for the Max statistic, allowing for flexible stopping rules.
- Establishment of a Moderate Deviations Principle for sequential $t$ -statistics, enabling valid multiple testing in high-dimensional settings ( $k \gg T$ ).
Optimal Design via Bandit Optimization:
- Recasting experimental design as a stochastic optimization problem where the reward is the SNR.
- Proposing SN-UCB, an algorithm specifically designed to maximize the power of the proposed tests, with proven logarithmic regret bounds.
Theoretical Guarantees: The paper provides finite-sample and asymptotic validity guarantees (Type I error control) and regret bounds, bridging the gap between game-theoretic statistics (supermartingales) and bandit optimization.

4. Results

Theoretical Results

Type I Error Control: Both the Pooled and Max statistics control the Type I error rate at level $\alpha$ under the null hypothesis, even with strategic sampling and early stopping.
Asymptotic Validity: The Pooled statistic achieves the nominal size (non-conservative) as $T \to \infty$ . The Max statistic is slightly conservative but offers flexibility for early stopping.
Regret: SN-UCB achieves a regret bound of $O(\log T)$ , ensuring that the algorithm quickly concentrates samples on the arm with the highest SNR.

Simulation Results

The authors conducted Monte Carlo simulations comparing SN-UCB against Uniform Allocation, Standard UCB, Thompson Sampling, and an Oracle (which knows the true SNR).

Multi-Scale Scenario (Heterogeneous Effects): In settings where the arm with the highest mean has a lower SNR (due to high variance), SN-UCB significantly outperforms all other methods. Standard UCB and Thompson sampling fail here because they target the mean, not the SNR, leading to over-sampling of high-variance, low-power arms.
Single-Spike Scenario (Homogeneous Variance): When one arm has a large effect and variances are equal, Standard UCB and Thompson Sampling perform comparably or slightly better than SN-UCB for the Pooled statistic, as they aggressively concentrate on the single best arm. However, SN-UCB remains competitive.
Power Gains: Adaptive allocation (specifically SN-UCB) substantially increases statistical power compared to uniform allocation, often approaching the performance of the Oracle, despite the cost of robustness to strategic sampling.

5. Significance and Impact

Practical Utility: The framework is highly relevant for early-stage drug discovery, online platform experimentation (A/B/n testing), and policy evaluation where the primary goal is "proof of concept" (demonstrating any positive effect) rather than precise estimation.
Robustness: The methods allow researchers to use sophisticated adaptive designs without sacrificing statistical validity, addressing a major pain point in modern data science where "peeking" and adaptive stopping are common.
Theoretical Advancement: The extension of moderate deviations principles to sequential $t$ -statistics under adaptive sampling provides a new tool for time-uniform inference in high-dimensional settings, generalizing previous work on best-arm identification.
Design Optimization: By linking experimental design directly to the optimization of test statistics (via SN-UCB), the paper offers a principled way to allocate limited resources to maximize the chance of detecting an effect.

In summary, this paper provides a rigorous statistical framework for "demonstration experiments," offering valid inference tools and an optimized sampling algorithm that outperforms traditional methods in heterogeneous environments.