Policy-Aware Design of Large-Scale Factorial Experiments

Imagine you are the CEO of a massive online store. You have a million different ways to design your checkout page. You could change the button color (Red, Blue, Green), the payment flow (One-click, Two-step), and the coupon placement (Top, Bottom, Pop-up).

If you multiply all those options together, you get 1,200 different combinations. If you had a million other features, you'd have trillions of combinations.

The Problem:
You only have a limited amount of "traffic" (real people visiting your site) to test these ideas. If you try to test every single combination one by one (like a traditional A/B test), you would run out of customers before you even finished testing the first 1%. It's like trying to find a specific needle in a haystack by pulling out one straw at a time. You'll never finish.

Furthermore, if you test the "Button Color" team and the "Payment Flow" team separately, they might miss the magic. Maybe a Red Button works great with a One-Click Flow, but terrible with a Two-Step Flow. If you test them in isolation, you miss that secret recipe.

The Solution: "Centralize and Then Randomize"
The authors propose a smart, two-step strategy to find the best design without testing everything. Think of it as a Talent Show combined with a Scout.

Step 1: The "Scout" Phase (Tensor Completion)

Instead of testing every single contestant, you use a "Scout" to look at a small, random sample of the crowd and guess the potential of everyone else.

The Analogy: Imagine a music competition with 1,000 singers. You can't listen to all of them. Instead, you listen to 50 random singers.
The Magic: You notice a pattern. The singers who are good at "High Notes" also tend to be good at "Stage Presence." You realize the talent isn't random; it's built on a few underlying "themes" (like the Low-Rank Tensor concept in the paper).
The Action: Based on these patterns, you can mathematically predict that "Singer #405" (who you haven't heard yet) is probably terrible because they lack those themes. You eliminate the bottom 50% of singers immediately without ever hearing them sing. You do this again and again, cutting the crowd in half each time, until you have a small group of "Top Contenders."

Why this works: You aren't guessing blindly. You are using the fact that the world is structured. If you know how "Red" works with "Blue," you can guess how "Red" works with "Green" without testing it.

Step 2: The "Talent Show" Phase (Sequential Halving)

Now you have a small group of the best candidates (maybe just 10 singers left). You have a limited amount of time left in the day.

The Analogy: You put these 10 finalists on stage. You give them all an equal amount of time to perform.
The Action: After the first round, you see who got the best applause. You cut the bottom 5. The remaining 5 get more time to perform. You cut the bottom 2. The final 3 get even more time.
The Result: You keep funneling your remaining time (budget) toward the winners until only one champion remains.

Why is this better than the old way?

It's Fast: The old way tries to measure everything perfectly. This way tries to find the winner quickly.
It's Smart: It understands that things are connected. A red button isn't just a color; it's part of a "vibe." The algorithm learns the "vibe" (the low-rank structure) and uses it to predict winners.
It Saves Money: In the real world, testing costs money (traffic). This method finds the best product design using a fraction of the traffic required by traditional methods.

The Real-World Test

The authors tested this on Taobao (a massive Chinese e-commerce site). They tried to figure out the best "product bundles" (e.g., Pasta + Sauce + Cheese).

There were 1,680 possible bundles.
Traditional methods failed when they didn't have enough traffic to test them all.
This new "Scout then Talent Show" method found the best bundles even with very little traffic, especially when the data was "noisy" (uncertain).

The Takeaway

In a world where we have too many ideas and too little time, we can't test everything. Instead of testing every single combination, we should:

Look for patterns to eliminate the bad ideas early (The Scout).
Focus our remaining resources on the few ideas that look promising (The Talent Show).

This allows companies to innovate faster, spend less money on testing, and launch better products.

1. Problem Statement

Digital firms routinely conduct thousands of online experiments to optimize product features (e.g., UI elements, flows, incentives). However, modern product design is compositional, meaning the number of feasible interventions grows combinatorially (e.g., $10 \times 5 \times 6 \times 4 = 1,200$ combinations).

The Bottleneck: User traffic (experimental budget) is limited, while the design space is vast.
The Limitation of Current Methods:
- Decentralized A/B Testing: Treating factors independently ignores interaction effects, leading to biased estimates and suboptimal policies.
- Classical Factorial Designs: Require estimating all main and interaction effects, which is infeasible when the number of factors is large.
- Standard Bandit Algorithms: Treat every combination as an independent "arm," leading to "exploration overhead" where the budget is exhausted before even visiting a fraction of the combinations.
The Core Objective: The goal is not to estimate every treatment effect precisely (parameter estimation) but to identify a single high-performing policy (optimal decision) under a fixed budget. This is framed as a Simple Regret minimization problem (minimizing the opportunity cost of the final recommendation).

2. Methodology: "Centralize and Then Randomize"

The authors propose a two-stage adaptive design that shifts from decentralized testing to a centralized, structural approach leveraging Low-Rank Tensor Completion.

Stage I: Tensor Stage (Structural Screening)

Centralization: Overlapping experiments are modeled as a single high-dimensional tensor $T^* \in \mathbb{R}^{d_1 \times \dots \times d_m}$ , where modes represent factors and indices represent levels.
Low-Rank Assumption: The performance surface is assumed to be a low-rank tensor (specifically Tucker decomposition). This implies that the complex interaction space is driven by a small number of latent behavioral mechanisms, not independent noise.
Mechanism:
1. Random Sampling: A subset of intervention combinations is sampled randomly.
2. Tensor Completion: A tensor completion algorithm (e.g., Riemannian Gradient Descent) infers the performance of unobserved combinations.
3. Factor Level Marginal Contribution (FLMC): For each factor level, the algorithm calculates the maximum estimated performance achievable across all valid combinations containing that level.
4. Sequential Elimination: Based on FLMC, the bottom 50% of levels for each factor are pruned simultaneously. This reduces the design space exponentially ( $d \to d/2$ ) without testing every combination.
5. Switch Point ( $L_I$ ): This stage continues for a predetermined number of rounds until the remaining design space is small enough that the low-rank assumption might break down or the budget is better spent on direct comparison.

Stage II: Vector Stage (Refinement)

Transition: The surviving combinations (a small subset of the original space) are treated as distinct, independent arms.
Sequential Halving (SH): A standard best-arm identification algorithm is applied. It allocates the remaining budget to the surviving arms, repeatedly halving the set of candidates by eliminating the worst performers based on empirical means, until a single policy remains.

3. Key Contributions

Theoretical Contributions

Policy-Aware Framework: The paper reframes factorial experimentation from parameter estimation to policy selection, explicitly targeting Simple Regret rather than Mean Squared Error (MSE) or p-values.
Theoretical Guarantees:
- Gap-Independent Bound: Establishes a worst-case simple regret bound that scales with the effective degrees of freedom ($df$) of the low-rank tensor ( $O(\sqrt{df})$ ) rather than the full factorial size ( $O(\sqrt{d^m})$ ). This implies the required budget scales with the square root of the design space size, not the space itself.
- Gap-Dependent Bound: Provides instance-specific guarantees showing that if the top factor levels are well-separated, the algorithm identifies the optimal policy with significantly fewer samples.
Handling Interactions: By modeling interactions as a structured low-rank tensor, the method treats interactions as a feature to be exploited for information sharing, rather than a nuisance to be averaged out.

Empirical Contributions

Semi-Synthetic Evaluation: The method was tested on a product-bundling problem using 100 million interactions from Alibaba's Taobao platform.
Performance: The proposed "Two-stage" method substantially outperformed:
- One-shot Tensor Completion: Which is sensitive to noise and lacks adaptive refinement.
- Vector Sequential Halving: Which fails in low-budget regimes due to the inability to sample enough independent arms.
Robustness: The method showed superior performance in high-noise and low-budget settings, effectively identifying high-performing bundles even when the vast majority of combinations were never tested.

4. Results and Significance

Key Results

Efficiency: The algorithm reduces the required traffic budget from being proportional to the full design space ( $N \gg d^m$ ) to being proportional to the square root of the design space ( $N \gg d^{m/2}$ ), provided the low-rank structure holds.
Scalability: It enables the exploration of exponential design spaces (e.g., 1,680 bundles in the experiment, potentially millions in practice) with limited user traffic.
Noise Resilience: By leveraging structural correlations, the method can filter out noise and identify optimal policies even when individual metric observations are highly volatile.

Significance

Operational Feasibility: It provides a practical solution for digital firms to run large-scale combinatorial experiments without serializing tests or suffering from interference.
Decision-Centric Design: It moves the field away from "scientific" estimation of all effects toward "business" optimization of the final decision, aligning statistical design with managerial objectives.
Generalizability: While demonstrated on e-commerce bundling, the framework applies to any domain with compositional interventions (e.g., personalized medicine, ad creative optimization, UI design).

Summary

This paper bridges the gap between tensor completion (statistical recovery) and pure exploration bandits (decision making). By centralizing overlapping experiments into a low-rank tensor and using a two-stage "screen-then-refine" strategy, the authors demonstrate that firms can identify optimal policies in massive combinatorial spaces with a fraction of the traffic previously thought necessary. This transforms the "curse of dimensionality" in product design into a manageable structural problem.