What Do We Care About in Bandits with Noncompliance? BRACE: Bandits with Recommendations, Abstention, and Certified Effects

This paper introduces BRACE, a parameter-free algorithm for multi-armed bandits with noncompliance that simultaneously optimizes recommendation welfare and treatment learning by performing certified instrumental variable inversion only when identification is strong, otherwise providing honest structural intervals to navigate the trade-offs between mediated and direct-control regimes.

Nicolás Della Penna

Published Wed, 11 Ma
📖 6 min read🧠 Deep dive

Imagine you are a chef running a busy restaurant. Your goal is to make customers happy.

In the old way of doing things (the "Classical Bandit" model), the chef decides the menu, the customer orders exactly what is on the menu, and the chef learns from the customer's reaction. Simple.

But in the real world, things are messier. This is the world of "Noncompliance."

  • The chef (the AI/Algorithm) suggests a dish (a Recommendation).
  • But the customer (the patient, the driver, the user) might ignore the suggestion, swap it for something else, or ask the waiter to change it based on their own secret preferences (the Actual Treatment).

The paper you asked about asks a very important question: What should the chef actually be trying to learn?

The Three Different Goals

The paper argues that there are three different ways to measure "success," and they often contradict each other. You have to pick one before you start cooking.

  1. The "Current Reality" Goal (REC):

    • The Question: "How do we make customers happy right now, given that they might ignore us?"
    • The Metaphor: The chef learns to suggest dishes that, even if the customer changes them, still lead to a happy meal. Maybe the chef suggests "Spicy Tacos," knowing the customer will swap them for "Mild Tacos," but the chef knows that specific customer loves mild tacos. The goal is to optimize the suggestion to fit the messy reality.
    • Who cares? The people currently in the restaurant.
  2. The "Ideal Future" Goal (TRT):

    • The Question: "If we could magically force everyone to eat exactly what we tell them, what would be the best menu?"
    • The Metaphor: The chef ignores the fact that customers change orders. They try to learn the "perfect" dish for a hypothetical world where the customer always obeys. This is useful if the chef plans to open a new restaurant where they have total control (like a hospital where a doctor must follow a protocol).
    • Who cares? Future customers, regulators, or scientists who want a "pure" rule.
  3. The "Safety First" Goal (INF):

    • The Question: "Can we be sure we know the answer, or should we just admit we don't know yet?"
    • The Metaphor: If the chef is guessing wildly because the data is confusing, they shouldn't just pick a random dish. They should say, "I'm not sure yet, let's keep experimenting." This prevents the chef from confidently serving a dish that makes people sick.

The Big Problem: They Don't Agree

The paper's biggest discovery is that Goal 1 and Goal 2 often point in opposite directions.

  • The "Private Signal" Analogy: Imagine a customer has a secret allergy (a "private signal") that only they know.
    • If the chef tries to learn the "Ideal Future" rule (Goal 2), they might conclude: "Everyone should eat Steak." (Because in the data, Steak usually works).
    • But if the chef optimizes for the "Current Reality" (Goal 1), they might learn: "When I suggest 'Salad', this specific customer secretly switches to 'Steak' because they know they are allergic to the salad dressing."
    • Result: The best suggestion (Salad) leads to the best outcome (Steak). But the best direct treatment (Steak) would be a disaster if the customer actually ate it without the switch.
    • The Lesson: Sometimes, the best thing to say is different from the best thing to do. If you try to learn the "do" part while you are only allowed to "say" things, you might fail.

The Solution: BRACE (The Smart Chef)

The authors built a new algorithm called BRACE (Bandits with Recommendations, Abstention, and Certified Effects). Think of it as a super-smart kitchen manager with a safety checklist.

  1. It Picks a Goal First: Before it starts, you tell it: "Are we optimizing for the current messy restaurant (REC) or the future perfect one (TRT)?"
  2. It Checks the Math (Certification): Before the chef commits to a new menu, BRACE checks the math. "Do we have enough evidence that this suggestion works?"
    • If the data is shaky (like if the customer's behavior is random), BRACE says: "ABSTAIN." It stops making a decision and keeps experimenting. It would rather be slow than wrong.
    • If the data is strong, it says: "Go!"
  3. It Handles the "Messy" Stuff: If the math is too hard to solve perfectly (because the customer is too unpredictable), BRACE doesn't guess. It gives you a wide range of possibilities: "We are 95% sure the best dish is somewhere between 'Pizza' and 'Pasta'."

Why This Matters in Real Life

This isn't just about math; it's about ethics and design.

  • In Medicine: A doctor's AI might suggest a treatment. If the patient refuses it, the AI shouldn't try to learn "what the patient would have done if forced." It should learn "what to suggest so the patient ends up happy."
  • In Self-Driving Cars: The car suggests a lane change. The human driver might override it. Should the car learn to be a better "suggester" (to work with the human) or a better "driver" (to take over completely)? The paper says: Decide which one you want first.
  • In Business: If you are testing a new app feature, are you trying to improve the current user experience (where users might ignore your prompts), or are you trying to design a future version where you have full control?

The Bottom Line

The paper tells us: Stop trying to be everything at once.

If you are in a world where people have free will (and they do), you have to choose:

  1. Do you want to win today by working with their quirks? (Optimize Recommendations)
  2. Do you want to win tomorrow by designing a system where they have no choice? (Optimize Treatments)
  3. Do you want to be safe and admit when you don't know? (Optimize Inference)

The old way tried to do all three and got confused. The new way (BRACE) says: "Pick your lane, and I'll drive you there safely, even if it means I have to stop and say 'I don't know' sometimes."