Thompson Sampling via Fine-Tuning of LLMs

The paper introduces Thompson Sampling via Fine-Tuning (ToSFiT), a scalable Bayesian optimization method that leverages fine-tuned large language models to directly parameterize the probability of candidate optimality, thereby eliminating the need for costly acquisition function maximization while achieving state-of-the-art sample and computational efficiency across diverse discrete search tasks.

Nicolas Menet, Aleksandar Terzić, Michael Hersche, Andreas Krause, Abbas Rahimi

Published 2026-03-02
📖 4 min read☕ Coffee break read

Imagine you are a treasure hunter trying to find the single best spot to dig for gold in a massive, uncharted jungle. The jungle is so huge that you can't possibly check every single square inch (that's the "large unstructured discrete space"). You have a map, but it's blurry and incomplete.

Traditionally, treasure hunters use a method called Bayesian Optimization. They build a model of the jungle, guess where the gold might be, and then try to solve a complex math puzzle to find the absolute best spot to dig next. The problem? In a jungle with no clear paths (no "gradients"), solving that puzzle is like trying to find a needle in a haystack by checking every single piece of hay one by one. It takes forever.

This paper introduces a new method called TOSFIT (Thompson Sampling via Fine-Tuning). Here's how it works, using simple analogies:

1. The Old Way: The Exhaustive Search

Imagine you have a super-smart robot that can predict where gold is. To find the best spot, the old method asks the robot: "Okay, based on what I know, show me the ONE perfect spot to dig."
The robot has to look at millions of possibilities, calculate the odds for each, and pick the winner. In a complex jungle, this takes so long that you run out of time before you even dig once.

2. The New Way: The Intuitive Guide (TOSFIT)

Instead of asking the robot to solve a math puzzle to find the one best spot, TOSFIT changes the game. It treats the robot like a creative writer (a Large Language Model) who already knows a lot about the world.

  • The Starting Point (Pre-training): Imagine the robot is a seasoned explorer who has read every travel guide ever written. It already has a "gut feeling" about where gold usually hides. We don't start from scratch; we start with this expert's intuition.
  • The Process (Fine-Tuning): As you dig and find gold (or dirt), you don't ask the robot to re-calculate the whole map. Instead, you gently teach the robot. You say, "Hey, you guessed 'Spot A' was good, but we found gold at 'Spot B'. Next time, lean a little more toward Spot B."
  • The Magic (Thompson Sampling): The robot doesn't just pick the "best" spot. It generates a few different ideas for where to dig next, based on its updated "gut feeling." Some ideas are safe bets (exploitation), and some are wild guesses in new areas (exploration). This happens naturally, without needing to solve a hard math problem.

3. The "Regret" Problem: Why Careful Teaching Matters

The paper makes a crucial discovery about how you teach the robot.

  • The "Careless" Teacher: If you yell at the robot or change its mind too drastically after one mistake, it forgets everything it learned from the travel guides. It becomes confused and starts digging in random holes.
  • The "Careful" Teacher (TOSFIT): TOSFIT teaches the robot gently. It keeps the robot's original "gut feeling" (the pre-training) but slowly nudges it toward the new evidence. This ensures the robot stays smart and doesn't forget the basics while learning the new specifics.

4. Real-World Examples

The authors tested this on three very different "jungles":

  • FAQ Refinement: Teaching an AI to write better answers to customer questions.
  • Protein Search: Finding the perfect sequence of amino acids to create a super-stable protein (like finding a needle in a universe-sized haystack).
  • Quantum Circuit Design: Designing complex code for quantum computers.

In all three cases, TOSFIT found the best solutions faster and with fewer attempts than other methods. It was also computationally cheaper because it skipped the heavy math of the old methods.

The Big Picture

Think of TOSFIT as upgrading from a calculator to a mentor.

  • Old Method: "Calculate the exact probability of every single outcome and pick the winner." (Slow, hard, breaks in big spaces).
  • TOSFIT: "Here is an expert who knows the basics. Let's show them a few examples, let them adjust their intuition, and ask them to suggest a few good ideas." (Fast, scalable, and smart).

By combining the vast knowledge of AI models with a smart, gentle learning process, TOSFIT solves problems that were previously too big to tackle efficiently. It's like giving a treasure hunter a compass that learns as they walk, rather than forcing them to map the whole world before taking a single step.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →