Stability and Robustness via Regularization: Bandit Inference via Regularized Stochastic Mirror Descent

Imagine you are a chef running a food truck with a menu of 10 different dishes. Every day, you have to decide which dish to feature as the "Special of the Day" to attract customers. You don't know which dish is actually the best; you have to learn by serving them and seeing what people order.

This is the Multi-Armed Bandit problem. You are balancing two goals:

Making Money (Regret Minimization): You want to serve the best dish as often as possible so you make the most profit.
Knowing the Truth (Statistical Inference): You want to be able to say, with 95% certainty, "Dish #3 is definitely better than Dish #4," so you can write a review or plan your next menu.

The Problem: The "Adaptive" Trap

In the old days, scientists would say, "Just serve Dish #1 for 100 days, then Dish #2 for 100 days, and compare the results." This works because the data is independent (like flipping a coin).

But in the real world, you are smart. If you notice Dish #3 is selling like crazy, you start serving it more often. If Dish #4 is failing, you stop serving it. This is Adaptive Sampling.

Here is the catch: Your smartness breaks the math.
Because you changed your strategy based on what you saw, the data is no longer independent. If you try to use standard statistics to calculate a "confidence interval" (a range where the true quality of the dish likely lies), your numbers will be wrong. You might think Dish #3 is amazing when it's actually just lucky, or you might miss a hidden gem.

The Solution: Adding a "Safety Brake" (Regularization)

The authors of this paper propose a new way to run your food truck. They take a famous algorithm called EXP3 (which is great at making money) and add a special ingredient called Regularization.

Think of Regularization as a Safety Brake or a Stabilizer.

Without the brake: The algorithm is like a race car driver who swerves wildly. If Dish #3 gets one good review, the driver swerves hard to serve it 90% of the time. This makes money fast, but the data is messy and you can't trust your conclusions.
With the brake: The algorithm is like a cautious driver. Even if Dish #3 gets a great review, the "Regularizer" forces the driver to keep serving the other dishes a little bit. It prevents the driver from going too crazy in one direction too quickly.

This "braking" creates Stability. It ensures that the algorithm doesn't swing wildly. Because the algorithm is stable, the data collected remains "clean" enough for standard statistics to work again.

The Three Big Wins

1. Trustworthy Conclusions (Inference)
Because the algorithm is stable, you can finally build Confidence Intervals.

Analogy: Before, you were guessing the weight of a watermelon by feeling it while it was spinning on a table. Now, with the stabilizer, the watermelon is sitting still on a scale. You can say, "I am 99% sure this watermelon weighs between 10 and 12 pounds." The paper proves that their method gives you these trustworthy ranges, even while you are learning on the fly.

2. You Don't Lose Money (Regret)
You might worry: "If I force the driver to be cautious, won't I lose money?"
The authors prove that the answer is No. The "brake" is tuned so perfectly that you still make almost as much money as the most aggressive, reckless driver. You get the best of both worlds: you learn fast and you get reliable data.

3. The "Sabotage" Proof (Robustness)
This is the coolest part. Imagine a rival food truck owner who tries to sabotage you. They leave fake 5-star reviews for your worst dish or 1-star reviews for your best dish (this is called Adversarial Corruption).

Old Algorithms (like UCB): These are like glass houses. If the rival sabotages you a little bit, the algorithm panics, stops serving the good dish, and you lose all your money (Linear Regret).
This New Algorithm: It's like a bunker. Because it has that "Safety Brake" (Regularization), it ignores the small lies. Even if the rival tries to trick the system, the algorithm keeps serving the right dishes and keeps its statistical conclusions valid. It can handle a surprising amount of sabotage without breaking a sweat.

Summary

This paper introduces a new way to make decisions in uncertain environments (like your food truck, or a medical trial, or a recommendation engine).

They took a powerful learning tool, added a "stabilizer" (regularization), and proved that this simple tweak solves three huge problems at once:

It lets you learn fast (low regret).
It lets you trust your data (valid inference).
It lets you ignore liars (robustness to corruption).

It's a reminder that sometimes, being a little less "aggressive" and adding a little bit of "caution" actually makes you smarter, more reliable, and tougher.

Here is a detailed technical summary of the paper "Stability and Robustness via Regularization: Bandit Inference via Regularized Stochastic Mirror Descent."

1. Problem Statement

The paper addresses a fundamental tension in stochastic multi-armed bandit (MAB) problems: the conflict between regret minimization (learning efficiency) and valid statistical inference (reliable confidence intervals).

The Challenge: Classical statistical inference relies on independent and identically distributed (i.i.d.) data. However, bandit algorithms use adaptive sampling, where the choice of arm at time $t$ depends on past observations. This violates independence assumptions, causing standard estimators (like sample means) to be biased and rendering standard Wald-type confidence intervals invalid (they fail to achieve nominal coverage).
The Robustness Gap: Existing methods that attempt to fix inference (e.g., stabilized weighting, online debiasing) often fail when the feedback is corrupted by adversarial noise. Conversely, algorithms robust to corruption (like certain UCB variants) often lack the stability required for valid inference.
The Goal: Develop an algorithm that simultaneously achieves:
1. Stability: Satisfies conditions allowing for asymptotic normality of estimators and valid confidence intervals.
2. Regret Optimality: Achieves near-minimax optimal regret bounds.
3. Robustness: Maintains inference validity and low regret even under adversarial corruption of rewards.

2. Methodology

The authors propose a Regularized Stochastic Mirror Descent (SMD) framework, inspired by the EXP3 algorithm but modified to ensure stability.

Core Framework: Regularized-EXP3

The algorithm operates on a truncated probability simplex $\Delta_\epsilon$ (ensuring every arm has a minimum selection probability $\epsilon$ ). It minimizes a regularized objective function:
$f_{\lambda, \epsilon}(x) = \langle \mu, x \rangle + \lambda R_\epsilon(x)$
Where:

$\langle \mu, x \rangle$ is the linear loss.
$R_\epsilon(x) = -\sum \ln(x_i) + \frac{1}{\epsilon}\sum x_i$ is a log-barrier regularizer that prevents probabilities from collapsing to zero and ensures strict convexity.
$\lambda$ and $\epsilon$ are tuning parameters dependent on the time horizon $T$ .

Algorithmic Steps (Algorithm 2.1)

Initialization: Start with a uniform distribution $z_1$ .
Projection: At each round $t$ $t$ , compute the primal iterate $x_t$ $x_{t}$ by projecting $z_t$ $z_{t}$ onto $\Delta_\epsilon$ $Δ_{ϵ}$ using a Bregman divergence $D_\phi$ $D_{ϕ}$ induced by a Tsallis-entropy mirror map $\phi_\alpha$ $ϕ_{α}$ (where $\alpha \in [0, 1]$ $α \in [0, 1]$ ).
- $\alpha=1$ : Standard negative entropy (standard EXP3).
- $\alpha \in [0, 1)$ : Generalized Tsallis entropy.
Sampling: Pull arm $A_t$ according to distribution $x_t$ .
Update: Observe loss $\ell_t$ . Construct an importance-weighted loss estimator $\hat{\ell}_t$ .
Gradient Step: Update the dual variable $z_{t+1}$ using the gradient of the regularized loss:
$\tilde{\ell}_t = \hat{\ell}_t + \lambda \nabla R_\epsilon(x_t)$
The update follows the standard SMD step: $z_{t+1} = \arg\min_x \{ \eta \langle \tilde{\ell}_t, x \rangle + D_\phi(x, x_t) \}$ .

Key Theoretical Mechanism: Stability

The paper establishes that if the average iterates of the SMD algorithm converge in ratio to a deterministic probability vector, the induced bandit algorithm is stable (in the sense of Lai and Wei, 1982).

Stability Definition: An algorithm is stable if the number of pulls for arm $a$ , $n_{a,T}$ , satisfies $n_{a,T} / n^*_{a,T} \xrightarrow{P} 1$ , where $n^*_{a,T}$ is a deterministic sequence growing to infinity.
Result: Stability implies that the empirical mean $\hat{\mu}_{a,T}$ is asymptotically normal: $\sqrt{n_{a,T}}(\hat{\mu}_{a,T} - \mu_a) \xrightarrow{d} N(0, \sigma_a^2)$ .

3. Key Contributions

Unified Stability Theory for SMD:
The authors prove a general criterion: If the time-averaged sampling distribution of an SMD algorithm converges to a deterministic vector, the algorithm is stable. This provides a unified lens for analyzing stability across different mirror maps and regularizers.
Regularized-EXP3 with Provable Guarantees:
They introduce a specific family of algorithms using a log-barrier regularizer and Tsallis-entropy mirror maps. They prove:
- Inference Validity: Wald-type confidence intervals for linear functionals of the mean achieve nominal coverage asymptotically.
- Regret Optimality: The algorithms achieve minimax-optimal regret up to logarithmic factors. Specifically, the regret is $O(\sqrt{KT} \cdot \text{polylog}(T))$ , showing that inference-enabling stability does not significantly compromise learning efficiency.
Robustness to Adversarial Corruption:
The paper demonstrates that the Regularized-EXP3 algorithm is robust to $o(\sqrt{T})$ adversarial corruption.
- Even if an adversary corrupts rewards such that the total corruption budget $C_T = o(\sqrt{T})$ , the algorithm maintains asymptotic normality of the empirical means.
- This contrasts sharply with UCB-based methods, which suffer linear regret under much smaller ( $O(\log T)$ ) corruption levels.

4. Main Results (Theorems)

Theorem 1 (Stability & Inference): Under specific parameter settings ( $\eta = 1/\sqrt{T}$ , $\epsilon = \log T / \sqrt{T}$ , $\lambda = \gamma_T / \sqrt{KT}$ ), Algorithm 2.1 is stable. Consequently, for any direction $u$ , the confidence interval $CI_{u, \alpha_0}$ covers $u^\top \mu$ with probability $1-\alpha_0 $as$ T \to \infty$.
Theorem 2 (Regret Bound): The regret $R(T)$ is bounded by $O(\sqrt{KT} \log T)$ (with specific constants depending on the mirror map parameter $\alpha$ ). The cost of adding regularization for stability is only a logarithmic factor compared to standard EXP3.
Theorem 3 (Robustness): If the corruption budget $C_T \le K \cdot T^\beta$ for $\beta < 1/2$ , the algorithm remains stable, and the empirical means remain asymptotically normal.
Theorem 4 (Regret under Corruption): The regret bound under corruption scales with the corruption level but remains sub-linear ( $O(\sqrt{KT} \cdot T^\beta)$ ), ensuring the algorithm does not fail catastrophically.

5. Significance and Impact

Resolving the Adaptivity-Inference Trade-off: The paper demonstrates that the instability of adaptive sampling is not an intrinsic property but an artifact of algorithmic design. By carefully choosing the regularizer and mirror map, one can achieve both low regret and valid inference.
Robustness without Sacrifice: It challenges the notion that robustness requires sacrificing statistical inference. The proposed method is one of the first to simultaneously guarantee valid confidence intervals and robustness to adversarial corruption in the bandit setting.
Practical Applicability: The results are highly relevant for real-world applications like recommendation systems and clinical trials, where data may be noisy, logging errors may occur (corruption), and practitioners require reliable confidence intervals for decision-making, not just low regret.
Theoretical Foundation: The work extends the Lai-Wei stability theory to a broad class of SMD algorithms, providing a rigorous mathematical framework for future research in adaptive inference.

6. Empirical Validation

The authors conducted simulations on Bernoulli bandits:

Asymptotic Normality: Histograms of standardized estimation errors matched the standard Gaussian density.
Coverage: Empirical coverage probabilities for confidence intervals closely tracked the nominal levels (e.g., 95%) across various confidence levels, confirming the theoretical predictions.
Stability: In scenarios with identical arms, the pull proportions converged to the uniform distribution ($1/K$), confirming the stability condition.

In summary, this paper provides a rigorous, unified framework for designing bandit algorithms that are stable (for inference), efficient (low regret), and robust (to corruption), utilizing the power of regularized stochastic mirror descent.

Stability and Robustness via Regularization: Bandit Inference via Regularized Stochastic Mirror Descent

The Problem: The "Adaptive" Trap

The Solution: Adding a "Safety Brake" (Regularization)

The Three Big Wins

Summary

1. Problem Statement

2. Methodology

Core Framework: Regularized-EXP3

Algorithmic Steps (Algorithm 2.1)

Key Theoretical Mechanism: Stability

3. Key Contributions

4. Main Results (Theorems)

5. Significance and Impact

6. Empirical Validation

More like this

Modeling extremal dependence in multivariate and spatial problems: a practical perspective

Identifying Treatment Effect Heterogeneity with Bayesian Hierarchical Adjustable Random Partition in Adaptive Enrichment Trials

Comparative e-backtests for general risk measures

Estimating the distance at which narwhal (Monodon monoceros)(\textit{Monodon monoceros})(Monodon monoceros) respond to disturbance: a penalized threshold hidden Markov model

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

Estimating the distance at which narwhal $(\textit{Monodon monoceros})$ respond to disturbance: a penalized threshold hidden Markov model