From Bandit Regret to FDR Control: Online Selective Generation with Adversarial Feedback Unlocking

Imagine you have a very smart, but sometimes overconfident, robot assistant. You ask it questions, and it usually gives great answers. But sometimes, when it's not sure, it just makes something up (a "hallucination"). In high-stakes situations—like medical advice or legal research—getting a wrong answer is dangerous.

To fix this, we usually tell the robot: "If you aren't 100% sure, just say 'I don't know'." This is called Selective Generation. The goal is to keep the robot's "False Discovery Rate" (FDR) low. Think of FDR as the percentage of times the robot thinks it's right but is actually wrong. We want to keep this number below a safe limit (say, 5%).

The Problem: The Robot is Playing a Game in the Dark

In the real world, we don't get a perfect scorecard after every answer. We don't get a "Correct" or "Incorrect" label immediately. Instead, we get partial feedback, like a user giving a "thumbs up" or "thumbs down."

Even worse, the environment can be tricky. The questions might change topics suddenly (like switching from cooking to quantum physics), or a user might be trying to trick the robot with tricky questions (an "adversary").

Existing methods for teaching robots in these conditions are either too slow, require perfect scorecards (which we don't have), or break down when the questions change.

The Solution: ExSUL (The "Feedback Unlocking" Detective)

The authors of this paper propose a new method called ExSUL. They treat the problem like a game of Multi-Armed Bandits (imagine a row of slot machines).

The Slot Machines: Each "machine" is a different setting for the robot's "caution level" (a threshold).
- Machine A: "Answer everything, even if I'm 10% sure." (High risk, high reward).
- Machine B: "Only answer if I'm 99% sure." (Low risk, low reward).
- Machine C: "Only answer if I'm 50% sure."
The Goal: The robot needs to figure out which "machine" (caution level) is the best one to use right now to keep the error rate low while still answering enough questions to be useful.

The Magic Trick: "Feedback Unlocking"

Here is the clever part. In a normal slot machine game, if you pull the lever on Machine A, you only find out if Machine A won or lost. You learn nothing about Machines B, C, or D. This makes learning very slow.

But in this specific robot game, the rules are special. The "caution levels" are arranged on a line (from 0% to 100%).

If the robot answers a question with a low caution level (e.g., 20%), it implies it would also have answered with a higher caution level (e.g., 50%).
If the robot says "I don't know" with a high caution level (e.g., 90%), it implies it would also have said "I don't know" with a lower level (e.g., 40%).

ExSUL uses a technique called "Feedback Unlocking."
Imagine you pull the lever on Machine A and get a "Thumbs Up." Because of the special rules of the game, ExSUL realizes: "Wait a minute! If Machine A got a thumbs up, then Machine B, C, and D (which are more cautious) would have also been safe to play!"

It effectively unlocks information about all the other machines just by playing one. This allows the robot to learn much faster than before, even with only partial "thumbs up/down" feedback.

The "Regret-to-FDR" Translator

The paper also introduces a mathematical "translator." Usually, computer scientists measure success by "Regret" (how much worse the robot did compared to the perfect strategy). But we care about "FDR" (the error rate).

The authors proved a new rule: If you minimize Regret, you automatically control the FDR.
Think of it like a speedometer and a fuel gauge. Usually, they measure different things. But the authors found a special car where if you keep the speedometer (Regret) low, the fuel gauge (FDR) is guaranteed to stay in the green zone. This means they can use standard, powerful learning algorithms and know for a fact that the robot won't lie too often.

The Results: A Robot That Knows Its Limits

The team tested ExSUL with real Large Language Models (like GPT-3.5 and LLaMA) in four different worlds:

Steady World: Questions stay the same.
Shifting World: Questions suddenly change topics.
Chat World: A back-and-forth conversation.
Tricky World: An "adversary" tries to trick the robot into lying.

The Verdict:

ExSUL kept the error rate (FDR) strictly under the limit (e.g., below 8%) in all scenarios.
It didn't just say "I don't know" to everything (which would be safe but useless). It kept answering questions confidently when it was right.
Other methods either lied too much or stopped answering too often.

Summary Analogy

Imagine you are a bouncer at a club.

The Old Way: You guess who to let in based on a gut feeling. Sometimes you let in troublemakers (hallucinations), sometimes you kick out good people (inefficiency).
The ExSUL Way: You have a special training system. Every time you let someone in or out, you get a "thumbs up/down" from the crowd.
- If you let in a guy in a suit and he's cool, you instantly know that anyone in a suit would have been cool too (Feedback Unlocking).
- You adjust your bouncer rules instantly to keep the club safe (Control FDR) without turning away every single person (Maximize Efficiency).

This paper gives AI systems a way to be humble (admitting when they don't know) and smart (learning quickly from limited feedback), making them much safer for real-world use.

1. Problem Definition

The paper addresses the critical challenge of reliability in interactive generative systems (e.g., Large Language Models) deployed in real-world environments.

The Core Issue: Generative models often produce "hallucinations" (unreliable or false responses). While selective generation (abstaining from answering when uncertain) is a standard mitigation strategy, existing methods lack formal guarantees in dynamic, real-world settings.
The Constraints:
1. Partial Feedback: Real-world systems rarely receive ground-truth labels ( $y_t$ ). Instead, they receive sparse, binary user feedback (e.g., thumbs up/down, denoted as $e_t \in \{0, 1\}$ ).
2. Non-Stationarity & Adversariality: Data distributions shift over time, and feedback can be adversarial (e.g., a user intentionally trying to trick the model). Most existing selective prediction methods assume independent and identically distributed (i.i.d.) data and full feedback.
3. Goal: The objective is to learn an online selective generator that controls the False Discovery Rate (FDR) at a desired level $\alpha$ while maximizing selection efficiency (minimizing the rate of abstention), using only partial feedback.

2. Methodology: ExSUL

The authors propose ExSUL (Online Selective Generation with Feedback Unlocking), a novel framework that bridges online learning, adversarial bandits, and selective generation.

A. Reduction to Adversarial Bandits

The problem is reduced to an Adversarial Multi-Armed Bandit problem:

Arms: The set of hypotheses $\mathcal{H}$ corresponds to different threshold parameters $\tau$ for the selection function.
Loss Function: A custom loss function $\ell_t(\tau, \alpha)$ $ℓ_{t} (τ, α)$ is designed to balance two competing objectives:
1. FDR Loss: Penalizes incorrect answers that were not abstained from.
2. Efficiency Loss: Penalizes unnecessary abstentions (abstaining when the answer was correct).
  The loss is defined as:
  $\ell_t(\tau, \alpha) = \frac{a_t(\tau) + \lambda d_t(\tau, \alpha)}{1 + \lambda}$
  Where $a_t$ is the inefficiency indicator, $d_t$ is the FDR violation indicator, and $\lambda$ is a trade-off hyperparameter.

B. Regret-to-FDR Conversion Lemma

A key theoretical contribution is a novel conversion lemma that links the Regret of the bandit algorithm to the FDR Risk of the selective generator.

The lemma proves that if a learner minimizes the regret of the constructed loss function, the cumulative FDR risk $R_{FDR}^T$ is bounded.
Specifically, if the regret grows sublinearly (e.g., $O(\sqrt{T})$ ), the average FDR converges to the target level $\alpha$ at a rate of $O(T^{-1/4})$ . This allows the use of any regret-minimization algorithm to achieve FDR control.

C. Feedback Unlocking

To address the inefficiency of standard partial-feedback algorithms (like Exp3-IX) which suffer from high variance due to limited information, the authors introduce Feedback Unlocking.

Mechanism: The selection function in this context has a monotonic structure: if a model abstains for a threshold $\tau$ , it will also abstain for any $\tau' > \tau$ . Conversely, if it answers for $\tau$ , it answers for all $\tau' < \tau$ .
Unlocking: When the learner observes feedback for a chosen arm $\tau_t$ , it can infer the feedback for a set of other arms (a subset of $\mathcal{H}$ ) based on the monotonicity of the selection function and the observed score $f(x_t)$ .
Algorithm: The authors extend the Exp3-IX algorithm (Exponential weights for Exploration and Exploitation with Implicit eXploration) to utilize this "unlocked" information. They construct a novel loss estimator that leverages the inferred feedback from the entire set of consistent arms, rather than just the single chosen arm.

3. Key Contributions

Theoretical Framework: The first framework to provide formal FDR guarantees for online selective generation under partial, adversarial feedback.
Regret-to-FDR Lemma: A general lemma converting bandit regret bounds into FDR bounds, applicable to both full and partial feedback settings.
Feedback Unlocking: A technique that exploits the structural properties of selective generation to extract additional learning signals from partial feedback, significantly improving sample efficiency.
Algorithm (ExSUL): An algorithm achieving a regret bound of $O(\sqrt{T \ln |\mathcal{H}|})$ . This matches the efficiency of full-information settings (Exponential Weighting) and is significantly better than standard partial-feedback algorithms (which typically suffer an extra $\sqrt{|\mathcal{H}|}$ factor).

4. Experimental Results

The authors evaluated ExSUL on question-answering tasks using GPT-3.5-turbo and LLaMA3.1 across four environments:

Stochastic: Fixed distribution.
Distribution-Shift: Sudden, alternating, or gradual shifts between datasets (TriviaQA and Natural Questions).
Interactive: Multi-turn dialog simulations.
Adaptive-Adversarial: A strategy-aware adversary (simulated by an LLM) attempting to maximize the model's failure rate.

Key Findings:

FDR Control: ExSUL consistently maintained the empirical FDR below the target threshold $\alpha$ across all environments, whereas baselines (like standard Exp3-IX-SG) failed to control FDR effectively, especially under distribution shifts or with limited feedback.
Efficiency: ExSUL achieved competitive selection efficiency (low abstention rates) compared to full-feedback upper bounds (EW-SG).
Robustness: The method demonstrated robustness against adaptive adversaries that attempted to manipulate the feedback distribution to force failures.
Convergence: ExSUL converged to the desired FDR level significantly faster than baselines, validating the theoretical advantage of the feedback unlocking mechanism.

5. Significance

This work represents a significant step forward in the safe deployment of generative AI.

Practicality: It moves beyond idealized batch learning assumptions to address the reality of online, interactive systems where ground truth is unavailable.
Safety Guarantees: It provides a mathematically rigorous method to ensure that when an AI system does speak, the probability of it being wrong is controlled, even in hostile or shifting environments.
Efficiency: By unlocking hidden information in partial feedback, it reduces the sample complexity required to learn reliable policies, making it feasible for real-time applications.

In summary, ExSUL transforms the problem of controlling hallucinations in LLMs into a tractable bandit problem, offering a robust, theoretically grounded solution for the next generation of reliable, interactive AI systems.