What Do We Care About in Bandits with Noncompliance? BRACE: Bandits with Recommendations, Abstention, and Certified Effects

Imagine you are a chef running a busy restaurant. Your goal is to make customers happy.

In the old way of doing things (the "Classical Bandit" model), the chef decides the menu, the customer orders exactly what is on the menu, and the chef learns from the customer's reaction. Simple.

But in the real world, things are messier. This is the world of "Noncompliance."

The chef (the AI/Algorithm) suggests a dish (a Recommendation).
But the customer (the patient, the driver, the user) might ignore the suggestion, swap it for something else, or ask the waiter to change it based on their own secret preferences (the Actual Treatment).

The paper you asked about asks a very important question: What should the chef actually be trying to learn?

The Three Different Goals

The paper argues that there are three different ways to measure "success," and they often contradict each other. You have to pick one before you start cooking.

The "Current Reality" Goal (REC):
- The Question: "How do we make customers happy right now, given that they might ignore us?"
- The Metaphor: The chef learns to suggest dishes that, even if the customer changes them, still lead to a happy meal. Maybe the chef suggests "Spicy Tacos," knowing the customer will swap them for "Mild Tacos," but the chef knows that specific customer loves mild tacos. The goal is to optimize the suggestion to fit the messy reality.
- Who cares? The people currently in the restaurant.
The "Ideal Future" Goal (TRT):
- The Question: "If we could magically force everyone to eat exactly what we tell them, what would be the best menu?"
- The Metaphor: The chef ignores the fact that customers change orders. They try to learn the "perfect" dish for a hypothetical world where the customer always obeys. This is useful if the chef plans to open a new restaurant where they have total control (like a hospital where a doctor must follow a protocol).
- Who cares? Future customers, regulators, or scientists who want a "pure" rule.
The "Safety First" Goal (INF):
- The Question: "Can we be sure we know the answer, or should we just admit we don't know yet?"
- The Metaphor: If the chef is guessing wildly because the data is confusing, they shouldn't just pick a random dish. They should say, "I'm not sure yet, let's keep experimenting." This prevents the chef from confidently serving a dish that makes people sick.

The Big Problem: They Don't Agree

The paper's biggest discovery is that Goal 1 and Goal 2 often point in opposite directions.

The "Private Signal" Analogy: Imagine a customer has a secret allergy (a "private signal") that only they know.
- If the chef tries to learn the "Ideal Future" rule (Goal 2), they might conclude: "Everyone should eat Steak." (Because in the data, Steak usually works).
- But if the chef optimizes for the "Current Reality" (Goal 1), they might learn: "When I suggest 'Salad', this specific customer secretly switches to 'Steak' because they know they are allergic to the salad dressing."
- Result: The best suggestion (Salad) leads to the best outcome (Steak). But the best direct treatment (Steak) would be a disaster if the customer actually ate it without the switch.
- The Lesson: Sometimes, the best thing to say is different from the best thing to do. If you try to learn the "do" part while you are only allowed to "say" things, you might fail.

The Solution: BRACE (The Smart Chef)

The authors built a new algorithm called BRACE (Bandits with Recommendations, Abstention, and Certified Effects). Think of it as a super-smart kitchen manager with a safety checklist.

It Picks a Goal First: Before it starts, you tell it: "Are we optimizing for the current messy restaurant (REC) or the future perfect one (TRT)?"
It Checks the Math (Certification): Before the chef commits to a new menu, BRACE checks the math. "Do we have enough evidence that this suggestion works?"
- If the data is shaky (like if the customer's behavior is random), BRACE says: "ABSTAIN." It stops making a decision and keeps experimenting. It would rather be slow than wrong.
- If the data is strong, it says: "Go!"
It Handles the "Messy" Stuff: If the math is too hard to solve perfectly (because the customer is too unpredictable), BRACE doesn't guess. It gives you a wide range of possibilities: "We are 95% sure the best dish is somewhere between 'Pizza' and 'Pasta'."

Why This Matters in Real Life

This isn't just about math; it's about ethics and design.

In Medicine: A doctor's AI might suggest a treatment. If the patient refuses it, the AI shouldn't try to learn "what the patient would have done if forced." It should learn "what to suggest so the patient ends up happy."
In Self-Driving Cars: The car suggests a lane change. The human driver might override it. Should the car learn to be a better "suggester" (to work with the human) or a better "driver" (to take over completely)? The paper says: Decide which one you want first.
In Business: If you are testing a new app feature, are you trying to improve the current user experience (where users might ignore your prompts), or are you trying to design a future version where you have full control?

The Bottom Line

The paper tells us: Stop trying to be everything at once.

If you are in a world where people have free will (and they do), you have to choose:

Do you want to win today by working with their quirks? (Optimize Recommendations)
Do you want to win tomorrow by designing a system where they have no choice? (Optimize Treatments)
Do you want to be safe and admit when you don't know? (Optimize Inference)

The old way tried to do all three and got confused. The new way (BRACE) says: "Pick your lane, and I'll drive you there safely, even if it means I have to stop and say 'I don't know' sometimes."

Here is a detailed technical summary of the paper "BRACE: Bandits with Recommendations, Abstention, and Certified Effects" by Nicolás Della Penna.

1. Problem Definition: Bandits with Noncompliance

The paper addresses Contextual Bandits with Noncompliance, a setting where the learner's action (a recommendation or instrument, $Z$ ) is distinct from the treatment actually delivered ( $X$ ).

The Split: The learner chooses $Z_t$ , but the realized treatment is $X_t = C_t(Z_t)$ , where $C_t$ is a compliance type (potentially influenced by downstream actors like clinicians or patients using private information).
The Core Conflict: This separation creates a fundamental ambiguity regarding the learning objective. The paper argues that three distinct, non-interchangeable objectives exist:
1. REC (Operational Recommendation Welfare): Maximize welfare under the current mediated workflow (including downstream overrides).
2. TRT (Structural Treatment Welfare): Learn the best treatment rule for a future regime with direct control (counterfactual).
3. INF (Scientific Inference): Provide valid, anytime-valid uncertainty intervals for a chosen target under adaptive sampling.

Key Insight: In classical direct-control settings, REC and TRT coincide. However, in mediated settings, they diverge. The paper proves that under private discretion, the optimal recommendation policy (REC) can strictly outperform any treatment policy measurable by the learner (TRT), making the choice of objective a strategic decision about the deployment regime, not just a technical detail.

2. Methodology: The BRACE Algorithm

The authors propose BRACE (Bandits with Recommendations, Abstention, and Certified Effects), a parameter-free algorithm designed for finite-context, square-IV (Instrumental Variable) problems.

Core Mechanisms:

Phase-Doubling: The algorithm operates in phases $r$ where the horizon doubles ( $t_r = 2^r$ ).
Uniform Exploration: Within each phase, the algorithm samples recommendations uniformly to ensure sufficient data for all context-action pairs.
Matrix Certification: Before attempting to invert the compliance matrix (required for TRT), BRACE performs a "certification" check. It verifies that the empirical compliance matrix $\hat{P}(w)$ $\hat{P} (w)$ is well-conditioned enough to ensure stable inversion.
- If certified: It performs IV inversion to estimate structural means ( $\mu$ ) and provides tight confidence intervals.
- If not certified: It abstains from making structural claims, returning full-range (honest) intervals $[0, 1]$ to avoid unsafe decisions based on weak identification.
Objective-Specific Intervals:
- For REC: Uses direct empirical means of observed rewards. No inversion is needed.
- For TRT: Uses the inverted structural means ( $\hat{\mu} = \hat{P}^{-1}\hat{g}$ ) only when certified.
- For INF: Maintains valid confidence sequences for structural values.

Stopping Rule: The algorithm commits to a policy only when the lower bound of the best policy strictly exceeds the upper bound of all other policies (fixed-gap identification).

3. Key Contributions

A. Theoretical Formalization

Objective-Regime Separation: The paper formally distinguishes between the "Direct-Control Regime" (where REC=TRT) and the "Mediated Regime" (where they diverge).
Strict Operational Advantage: It provides a proof (Proposition 3.2) that in the presence of private downstream information, the optimal recommendation policy can achieve strictly higher welfare than any direct treatment policy learnable by the agent.
Identification Conditions: It clarifies that Structural Homogeneity (Assumption 4.1) is required for TRT point identification but not for REC. This implies that when homogeneity fails, TRT claims become unreliable, but REC remains a valid, learnable target.

B. Algorithmic Guarantees (Finite Context)

Simultaneous Validity: BRACE provides simultaneous confidence intervals for policy values with probability $1-\delta$.
Fixed-Gap Identification:
- REC: Identifies the optimal recommendation policy with sample complexity $\tilde{O}(K / (\nu_{min} \Delta_{rec}^2))$ .
- TRT: Identifies the optimal structural treatment policy (under homogeneity and invertibility) with sample complexity $\tilde{O}(L^2 K / (\nu_{min} \Delta_{str}^2))$ , where $L$ is the norm of the inverse compliance matrix.
Safety via Abstention: The algorithm guarantees that it will not deploy a structural policy if the identification is unstable (weak IV), effectively trading off regret for safety.

C. Rich Contexts & Semiparametric Roadmap

The paper derives a candidate orthogonal score for rich (continuous) contexts.
It establishes a product-form bias identity, showing that the conditional bias of the estimator factorizes into the product of the compliance-model error and the outcome-model error.
This highlights the necessity of stabilizing the inverse compliance map (via clipping or regularization) to achieve anytime-valid semiparametric inference.

4. Empirical Results

The authors present a comprehensive benchmark across 11 environments, including:

Direct-Control Equivalence: Confirms REC and TRT collapse.
Private-Signal Advantage: Demonstrates REC outperforming TRT.
Weak Identification: Shows that unsafe baselines (e.g., standard 2SLS or UCB) often deploy wrong policies with high confidence, whereas BRACE abstains and reports wide intervals.
Homogeneity Failure: When the structural assumption fails, TRT point estimates become meaningless, but REC remains robust.
Rectangular Overidentification: Shows that adding extra instruments (rectangular setting) can rescue structural identification where square settings fail, tightening uncertainty intervals.

Key Finding: Safety in BRACE manifests differently depending on the problem difficulty:

Easy problems: Safety appears as low regret (efficient learning).
Weak ID: Safety appears as abstention and wide intervals.
Homogeneity Failure: Safety appears as a preference for the REC objective over TRT.

5. Significance and Impact

Paradigm Shift: The paper challenges the default "treatment-first" language in clinical trials and bandit literature. It argues that in mediated systems (e.g., healthcare, recommendation systems), optimizing for the current workflow (REC) is often more ethically and practically relevant than optimizing for a hypothetical direct-control future (TRT).
Safety in Causal Learning: It introduces a rigorous framework for "safe" causal learning in bandits. By coupling certification with abstention, it prevents the deployment of policies based on unstable IV estimates, a common failure mode in adaptive experimentation.
Practical Guidance: It provides a decision framework for practitioners:
- If the goal is immediate workflow improvement and downstream discretion is legitimate $\rightarrow$ Optimize REC.
- If the goal is scientific transportability or workflow redesign $\rightarrow$ Optimize TRT (but only if structural assumptions hold).
- If identification is weak $\rightarrow$ Abstain from structural claims rather than risk harmful interventions.

In summary, BRACE is a foundational work that unifies causal inference, bandit learning, and decision theory, providing the first algorithmic framework that explicitly handles the trade-off between operational welfare and structural learning while guaranteeing safety against weak identification.