Thompson Sampling via Fine-Tuning of LLMs

Imagine you are a treasure hunter trying to find the single best spot to dig for gold in a massive, uncharted jungle. The jungle is so huge that you can't possibly check every single square inch (that's the "large unstructured discrete space"). You have a map, but it's blurry and incomplete.

Traditionally, treasure hunters use a method called Bayesian Optimization. They build a model of the jungle, guess where the gold might be, and then try to solve a complex math puzzle to find the absolute best spot to dig next. The problem? In a jungle with no clear paths (no "gradients"), solving that puzzle is like trying to find a needle in a haystack by checking every single piece of hay one by one. It takes forever.

This paper introduces a new method called TOSFIT (Thompson Sampling via Fine-Tuning). Here's how it works, using simple analogies:

1. The Old Way: The Exhaustive Search

Imagine you have a super-smart robot that can predict where gold is. To find the best spot, the old method asks the robot: "Okay, based on what I know, show me the ONE perfect spot to dig."
The robot has to look at millions of possibilities, calculate the odds for each, and pick the winner. In a complex jungle, this takes so long that you run out of time before you even dig once.

2. The New Way: The Intuitive Guide (TOSFIT)

Instead of asking the robot to solve a math puzzle to find the one best spot, TOSFIT changes the game. It treats the robot like a creative writer (a Large Language Model) who already knows a lot about the world.

The Starting Point (Pre-training): Imagine the robot is a seasoned explorer who has read every travel guide ever written. It already has a "gut feeling" about where gold usually hides. We don't start from scratch; we start with this expert's intuition.
The Process (Fine-Tuning): As you dig and find gold (or dirt), you don't ask the robot to re-calculate the whole map. Instead, you gently teach the robot. You say, "Hey, you guessed 'Spot A' was good, but we found gold at 'Spot B'. Next time, lean a little more toward Spot B."
The Magic (Thompson Sampling): The robot doesn't just pick the "best" spot. It generates a few different ideas for where to dig next, based on its updated "gut feeling." Some ideas are safe bets (exploitation), and some are wild guesses in new areas (exploration). This happens naturally, without needing to solve a hard math problem.

3. The "Regret" Problem: Why Careful Teaching Matters

The paper makes a crucial discovery about how you teach the robot.

The "Careless" Teacher: If you yell at the robot or change its mind too drastically after one mistake, it forgets everything it learned from the travel guides. It becomes confused and starts digging in random holes.
The "Careful" Teacher (TOSFIT): TOSFIT teaches the robot gently. It keeps the robot's original "gut feeling" (the pre-training) but slowly nudges it toward the new evidence. This ensures the robot stays smart and doesn't forget the basics while learning the new specifics.

4. Real-World Examples

The authors tested this on three very different "jungles":

FAQ Refinement: Teaching an AI to write better answers to customer questions.
Protein Search: Finding the perfect sequence of amino acids to create a super-stable protein (like finding a needle in a universe-sized haystack).
Quantum Circuit Design: Designing complex code for quantum computers.

In all three cases, TOSFIT found the best solutions faster and with fewer attempts than other methods. It was also computationally cheaper because it skipped the heavy math of the old methods.

The Big Picture

Think of TOSFIT as upgrading from a calculator to a mentor.

Old Method: "Calculate the exact probability of every single outcome and pick the winner." (Slow, hard, breaks in big spaces).
TOSFIT: "Here is an expert who knows the basics. Let's show them a few examples, let them adjust their intuition, and ask them to suggest a few good ideas." (Fast, scalable, and smart).

By combining the vast knowledge of AI models with a smart, gentle learning process, TOSFIT solves problems that were previously too big to tackle efficiently. It's like giving a treasure hunter a compass that learns as they walk, rather than forcing them to map the whole world before taking a single step.

1. Problem Statement

The paper addresses the challenge of Bayesian Optimization (BO) in large, unstructured discrete spaces.

The Bottleneck: Traditional BO relies on maximizing an "acquisition function" to select the next candidate. In continuous spaces, this is done via gradient ascent. However, in unstructured discrete domains (e.g., amino acid sequences, quantum circuit code, natural language text), gradients are absent, and the search space is combinatorially vast (often exceeding the number of atoms in the universe).
The Consequence: Maximizing the acquisition function becomes computationally intractable (requiring iteration over all points), rendering standard BO methods ineffective for these high-relevance scientific and engineering tasks.
Existing Limitations: Current alternatives like evolutionary search or reinforcement learning often lack the theoretical regret guarantees of BO or fail to leverage the rich prior knowledge embedded in Large Language Models (LLMs).

2. Methodology: TOSFIT

The authors propose TOSFIT (Thompson Sampling via Fine-Tuning), a scalable framework that treats candidate generation as a direct sampling process from the Probability of Maximality (PoM), bypassing the need for explicit acquisition function maximization.

Core Concept

Instead of optimizing an acquisition function, TOSFIT parameterizes the PoM directly using a pre-trained, prompt-conditioned Large Language Model. The LLM generates candidates (Thompson samples) that are naturally distributed according to the probability that a candidate yields the maximum reward.

Algorithmic Workflow

Initialization: Start with a pre-trained LLM ( $\pi_\theta$ ) conditioned on a prompt. This provides a strong prior over the search space.
Observation: Generate a batch of candidates, evaluate them to obtain rewards, and update a Gaussian Process (GP) reward model.
Variational Objective (VBOS): The method utilizes Variational Bayesian Optimistic Sampling (VBOS). The goal is to adapt the policy $\pi_\theta$ to maximize the functional:
$V(\pi) = \mathbb{E}_{x \sim \pi} \left[ \mu_x + \sqrt{-2 \ln(\pi_x)} \cdot \sigma_x \right]$
Where $\mu_x$ is the posterior mean and $\sigma_x$ is the posterior uncertainty (standard deviation) from the GP. The term $\sqrt{-2 \ln(\pi_x)}$ acts as an adaptive exploration bonus (similar to an Upper Confidence Bound).
Fine-Tuning: Instead of optimizing from a uniform policy, TOSFIT performs gradient ascent on the VBOS objective to fine-tune the LLM parameters.
- Gradient Estimation: Uses the Reinforce Leave-One-Out (RLOO) baseline with standardization to stabilize gradients and reduce variance.
- Efficiency: Leverages linear Gaussian processes with feature maps (embeddings) to ensure the GP update and marginal likelihood maximization run in constant time relative to the number of observations ( $\Theta(d^2)$ ).

Key Design Principles

Strong Priors: The policy is initialized with the pre-trained LLM rather than a uniform distribution. This is crucial because the exact VBOS solution in high-dimensional spaces is hard to reach; starting near the "optimal" region (the pre-trained manifold) is essential for convergence.
Cautious Adaptation: The fine-tuning uses small learning rates to ensure the policy adapts to the posterior without "forgetting" the structural knowledge (syntax, grammar, chemical rules) encoded in the pre-training.

3. Key Contributions

Theoretical Advances

Novel Regret Bound: The authors derive a cumulative regret bound for VBOS of $\tilde{O}(\sqrt{T \gamma_T})$ , where $\gamma_T$ is the maximum information gain. This improves upon the previous $\tilde{O}(\sqrt{T|X|})$ bound, making it applicable to combinatorially large spaces where $|X|$ is huge but the effective dimension $\gamma_T$ is small.
Approximate Policy Analysis: They extend the theory to inexact VBOS (where the policy is an approximation via gradient descent). They prove that the regret depends on the Bregman divergence between the learned policy and the exact VBOS maximizer, highlighting the critical need for careful initialization (pre-training) and cautious fine-tuning.
Energy-Based Interpretation: They interpret VBOS as an energy-based model where the gradient ascent pushes the probability of generation up if the model underestimates the true expected reward.

Empirical Validation

The method was tested on three diverse tasks:

FAQ Response Refinement: Optimizing text for semantic alignment with a ground truth.
Thermally Stable Protein Search: Designing amino acid sequences with high thermal stability (search space > $20^{100}$ ).
Quantum Circuit Design: Generating valid Qiskit circuits to prepare low-energy quantum states.

4. Results

TOSFIT was compared against a broad spectrum of baselines, including:

Unsupervised: Unguided Generation, Evolutionary Search (Character & LLM).
Reinforcement Learning: Actor-Critic, Soft Actor-Critic.
Bayesian Optimization: Post-Generation Thompson Sampling, FIBO (In-Context BO).

Key Findings:

Sample Efficiency: TOSFIT achieved State-of-the-Art (SOTA) performance in finding the best reward within a limited number of evaluations across all three tasks. It significantly outperformed evolutionary search and standard RL methods.
Computational Efficiency: Despite the overhead of fine-tuning, TOSFIT was computationally efficient. The fine-tuning cost was only ~19% of the total runtime (dominated by generation), and its superior sample efficiency meant it reached target rewards faster in wall-clock time than baselines that required more evaluations.
Scalability: TOSFIT scales effectively to larger models (e.g., 8B parameters) and supports batched optimization (synthesizing multiple candidates in parallel), improving iteration efficiency.
Ablation Studies:
- Removing the strong prior (weak context) led to significantly worse performance.
- Using a large learning rate caused the model to forget prior knowledge (diversity collapse/stagnation), confirming the need for "careful" adaptation.

5. Significance and Impact

Bridging BO and LLMs: TOSFIT provides a principled way to integrate the generative capabilities of foundation models with the rigorous uncertainty quantification of Bayesian Optimization.
Solving the "Intractable Maximization" Problem: By replacing the maximization step with a generative sampling step (fine-tuning), it unlocks BO for unstructured discrete domains that were previously inaccessible.
Scientific Discovery: The method is directly applicable to high-stakes scientific fields like drug discovery (protein design) and quantum computing, where the search spaces are too vast for exhaustive search and too complex for simple heuristics.
Theoretical Rigor: The paper establishes that fine-tuning an LLM to a variational objective can achieve the same theoretical convergence guarantees (regret bounds) as standard Thompson Sampling, provided the initialization and adaptation are handled correctly.

In summary, TOSFIT demonstrates that fine-tuning LLMs to approximate the posterior probability of maximality is a highly effective, theoretically grounded, and computationally efficient strategy for optimizing complex, unstructured discrete systems.