Stability and Robustness via Regularization: Bandit Inference via Regularized Stochastic Mirror Descent

This paper establishes a general stability criterion for stochastic mirror descent algorithms to enable valid statistical inference in adaptive bandit settings, introducing regularized-EXP3 variants that simultaneously achieve minimax-optimal regret, nominal confidence interval coverage, and robustness to adversarial corruptions.

Budhaditya Halder, Ishan Sengupta, Koustav Chowdhury, Koulik Khamaru

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Imagine you are a chef running a food truck with a menu of 10 different dishes. Every day, you have to decide which dish to feature as the "Special of the Day" to attract customers. You don't know which dish is actually the best; you have to learn by serving them and seeing what people order.

This is the Multi-Armed Bandit problem. You are balancing two goals:

  1. Making Money (Regret Minimization): You want to serve the best dish as often as possible so you make the most profit.
  2. Knowing the Truth (Statistical Inference): You want to be able to say, with 95% certainty, "Dish #3 is definitely better than Dish #4," so you can write a review or plan your next menu.

The Problem: The "Adaptive" Trap

In the old days, scientists would say, "Just serve Dish #1 for 100 days, then Dish #2 for 100 days, and compare the results." This works because the data is independent (like flipping a coin).

But in the real world, you are smart. If you notice Dish #3 is selling like crazy, you start serving it more often. If Dish #4 is failing, you stop serving it. This is Adaptive Sampling.

Here is the catch: Your smartness breaks the math.
Because you changed your strategy based on what you saw, the data is no longer independent. If you try to use standard statistics to calculate a "confidence interval" (a range where the true quality of the dish likely lies), your numbers will be wrong. You might think Dish #3 is amazing when it's actually just lucky, or you might miss a hidden gem.

The Solution: Adding a "Safety Brake" (Regularization)

The authors of this paper propose a new way to run your food truck. They take a famous algorithm called EXP3 (which is great at making money) and add a special ingredient called Regularization.

Think of Regularization as a Safety Brake or a Stabilizer.

  • Without the brake: The algorithm is like a race car driver who swerves wildly. If Dish #3 gets one good review, the driver swerves hard to serve it 90% of the time. This makes money fast, but the data is messy and you can't trust your conclusions.
  • With the brake: The algorithm is like a cautious driver. Even if Dish #3 gets a great review, the "Regularizer" forces the driver to keep serving the other dishes a little bit. It prevents the driver from going too crazy in one direction too quickly.

This "braking" creates Stability. It ensures that the algorithm doesn't swing wildly. Because the algorithm is stable, the data collected remains "clean" enough for standard statistics to work again.

The Three Big Wins

1. Trustworthy Conclusions (Inference)
Because the algorithm is stable, you can finally build Confidence Intervals.

  • Analogy: Before, you were guessing the weight of a watermelon by feeling it while it was spinning on a table. Now, with the stabilizer, the watermelon is sitting still on a scale. You can say, "I am 99% sure this watermelon weighs between 10 and 12 pounds." The paper proves that their method gives you these trustworthy ranges, even while you are learning on the fly.

2. You Don't Lose Money (Regret)
You might worry: "If I force the driver to be cautious, won't I lose money?"
The authors prove that the answer is No. The "brake" is tuned so perfectly that you still make almost as much money as the most aggressive, reckless driver. You get the best of both worlds: you learn fast and you get reliable data.

3. The "Sabotage" Proof (Robustness)
This is the coolest part. Imagine a rival food truck owner who tries to sabotage you. They leave fake 5-star reviews for your worst dish or 1-star reviews for your best dish (this is called Adversarial Corruption).

  • Old Algorithms (like UCB): These are like glass houses. If the rival sabotages you a little bit, the algorithm panics, stops serving the good dish, and you lose all your money (Linear Regret).
  • This New Algorithm: It's like a bunker. Because it has that "Safety Brake" (Regularization), it ignores the small lies. Even if the rival tries to trick the system, the algorithm keeps serving the right dishes and keeps its statistical conclusions valid. It can handle a surprising amount of sabotage without breaking a sweat.

Summary

This paper introduces a new way to make decisions in uncertain environments (like your food truck, or a medical trial, or a recommendation engine).

They took a powerful learning tool, added a "stabilizer" (regularization), and proved that this simple tweak solves three huge problems at once:

  1. It lets you learn fast (low regret).
  2. It lets you trust your data (valid inference).
  3. It lets you ignore liars (robustness to corruption).

It's a reminder that sometimes, being a little less "aggressive" and adding a little bit of "caution" actually makes you smarter, more reliable, and tougher.