Rigidity in LLM Bandits with Implications for Human-AI Dyads

This paper demonstrates that large language models exhibit robust decision biases in two-arm bandit tasks, characterized by stubborn exploitation and low learning rates that persist across decoding parameters, thereby posing significant challenges for optimal human-AI collaboration.

Haomiaomiao Wang, Tomás E Ward, Lili Zhang

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Here is an explanation of the paper, translated into everyday language with some creative analogies.

The Big Idea: AI is a Stubborn Explorer

Imagine you are playing a video game where you have to choose between two treasure chests, Chest X and Chest Y. You don't know which one has the gold, so you have to try them out to learn.

This paper asks a simple but scary question: If we put a Large Language Model (like the AI you chat with) in this game, will it act like a smart learner, or will it get stuck in a rut?

The researchers found that these AIs are incredibly stubborn. Once they make a guess, they rarely change their mind, even when the evidence suggests they should. They treat a tiny, accidental hint as a giant rule, and they refuse to double-check their work.


The Experiment: The "Space Explorer" Game

The researchers turned three popular AIs (DeepSeek, GPT-4.1, and Gemini) into "space explorers."

  • The Setup: They told the AI, "You are a space explorer. Visit Planet X or Planet Y to find gold coins. You don't know which planet has more gold yet."
  • The Rules: They ran this game 20,000 times (200 simulations × 100 rounds each) under different settings.
    • Scenario A (The Coin Flip): Both planets had an equal chance of having gold. A smart player should switch back and forth to see what happens.
    • Scenario B (The Cheat Code): One planet had gold 75% of the time, and the other only 25%. A smart player should find the good one and stick with it, but occasionally check the other one just in case.

What Happened? (The Results)

1. The "First Impression" Trap (Symmetric Rewards)

In the fair game (where both planets were equal), the AIs didn't act fairly.

  • The Analogy: Imagine you walk into a room with two identical doors. You happen to push the left one first, and it opens. You decide, "Aha! The left door is the magic door!" and you spend the next 99 tries pushing the left door, ignoring the right one completely.
  • The Reality: The AIs picked the first option (Planet X) almost immediately. Even though they got no extra gold for doing so, they stuck to that choice stubbornly. They amplified a tiny, random "nudge" into a rigid rule.

2. The "One-Track Mind" (Asymmetric Rewards)

In the unfair game (where one planet was clearly better), the AIs found the better planet quickly.

  • The Analogy: Imagine you find a vending machine that gives you a candy 75% of the time. A smart human would press that button, but maybe try the other button once in a while just to be sure the machine hasn't changed.
  • The Reality: The AIs found the good button and never tried the other one again. They became "rigid." They exploited the good option so hard that they missed out on small opportunities to verify their strategy. They were so confident they stopped learning.

The "Secret Sauce" (Why does this happen?)

The researchers used a mathematical model (like a detective looking at footprints) to figure out why the AIs acted this way. They found two main "personality traits" in the AI's brain:

  1. Slow Learner (Low Learning Rate): The AI is slow to update its beliefs. If it thinks "Planet X is good," it takes a lot of evidence to convince it that "Actually, Planet Y might be better."
  2. Over-Confident (High Inverse Temperature): This is the most important part. The AI is too certain. It acts like a robot that has decided, "I am 100% sure," rather than a human who says, "I'm pretty sure, but let me check."

The Analogy: Imagine a driver who sees a green light. A normal driver goes. A "rigid" driver sees the green light, decides "Green means Go," and then drives through the intersection at 100mph even if a red light appears 2 seconds later, because they are too locked into their initial decision to brake.

Does Changing the Settings Help?

The researchers tried to "fix" the AI by changing its settings (like turning up the "temperature" to make it more random or creative).

  • The Result: It didn't really work. Turning up the "creativity" knob just made the AI make more random mistakes (like typing the wrong letter), but it didn't make the AI smarter or more willing to explore. The underlying stubbornness remained.

Why Should You Care? (Human-AI Dyads)

This isn't just about a game; it's about how we use AI in real life.

  • The Danger of "Confident Wrongness": If you ask an AI for advice on a medical diagnosis or an investment, and it picks the first option it sees, it might stick to that advice even if new evidence suggests it's wrong.
  • The "Echo Chamber" Effect: Because the AI is so stubborn, if you give it a prompt that accidentally favors one side, it will double down on that side. It won't say, "Hey, maybe I should check the other side."
  • The Trap: Humans tend to trust confident AI. If the AI acts like a stubborn expert who never changes its mind, humans might follow it blindly, leading to bad decisions.

The Takeaway

Large Language Models are not flexible, curious learners. They are efficient but rigid optimizers. They are great at finding a path and sticking to it, but terrible at realizing when they need to change direction.

In short: If you treat an AI like a human partner who can adapt and learn, you might be in for a surprise. It's more like a very confident dog that, once it decides to chase a squirrel, will chase that squirrel until it hits a wall, ignoring all other squirrels.