Rigidity in LLM Bandits with Implications for Human-AI Dyads

Here is an explanation of the paper, translated into everyday language with some creative analogies.

The Big Idea: AI is a Stubborn Explorer

Imagine you are playing a video game where you have to choose between two treasure chests, Chest X and Chest Y. You don't know which one has the gold, so you have to try them out to learn.

This paper asks a simple but scary question: If we put a Large Language Model (like the AI you chat with) in this game, will it act like a smart learner, or will it get stuck in a rut?

The researchers found that these AIs are incredibly stubborn. Once they make a guess, they rarely change their mind, even when the evidence suggests they should. They treat a tiny, accidental hint as a giant rule, and they refuse to double-check their work.

The Experiment: The "Space Explorer" Game

The researchers turned three popular AIs (DeepSeek, GPT-4.1, and Gemini) into "space explorers."

The Setup: They told the AI, "You are a space explorer. Visit Planet X or Planet Y to find gold coins. You don't know which planet has more gold yet."
The Rules: They ran this game 20,000 times (200 simulations × 100 rounds each) under different settings.
- Scenario A (The Coin Flip): Both planets had an equal chance of having gold. A smart player should switch back and forth to see what happens.
- Scenario B (The Cheat Code): One planet had gold 75% of the time, and the other only 25%. A smart player should find the good one and stick with it, but occasionally check the other one just in case.

What Happened? (The Results)

1. The "First Impression" Trap (Symmetric Rewards)

In the fair game (where both planets were equal), the AIs didn't act fairly.

The Analogy: Imagine you walk into a room with two identical doors. You happen to push the left one first, and it opens. You decide, "Aha! The left door is the magic door!" and you spend the next 99 tries pushing the left door, ignoring the right one completely.
The Reality: The AIs picked the first option (Planet X) almost immediately. Even though they got no extra gold for doing so, they stuck to that choice stubbornly. They amplified a tiny, random "nudge" into a rigid rule.

2. The "One-Track Mind" (Asymmetric Rewards)

In the unfair game (where one planet was clearly better), the AIs found the better planet quickly.

The Analogy: Imagine you find a vending machine that gives you a candy 75% of the time. A smart human would press that button, but maybe try the other button once in a while just to be sure the machine hasn't changed.
The Reality: The AIs found the good button and never tried the other one again. They became "rigid." They exploited the good option so hard that they missed out on small opportunities to verify their strategy. They were so confident they stopped learning.

The "Secret Sauce" (Why does this happen?)

The researchers used a mathematical model (like a detective looking at footprints) to figure out why the AIs acted this way. They found two main "personality traits" in the AI's brain:

Slow Learner (Low Learning Rate): The AI is slow to update its beliefs. If it thinks "Planet X is good," it takes a lot of evidence to convince it that "Actually, Planet Y might be better."
Over-Confident (High Inverse Temperature): This is the most important part. The AI is too certain. It acts like a robot that has decided, "I am 100% sure," rather than a human who says, "I'm pretty sure, but let me check."

The Analogy: Imagine a driver who sees a green light. A normal driver goes. A "rigid" driver sees the green light, decides "Green means Go," and then drives through the intersection at 100mph even if a red light appears 2 seconds later, because they are too locked into their initial decision to brake.

Does Changing the Settings Help?

The researchers tried to "fix" the AI by changing its settings (like turning up the "temperature" to make it more random or creative).

The Result: It didn't really work. Turning up the "creativity" knob just made the AI make more random mistakes (like typing the wrong letter), but it didn't make the AI smarter or more willing to explore. The underlying stubbornness remained.

Why Should You Care? (Human-AI Dyads)

This isn't just about a game; it's about how we use AI in real life.

The Danger of "Confident Wrongness": If you ask an AI for advice on a medical diagnosis or an investment, and it picks the first option it sees, it might stick to that advice even if new evidence suggests it's wrong.
The "Echo Chamber" Effect: Because the AI is so stubborn, if you give it a prompt that accidentally favors one side, it will double down on that side. It won't say, "Hey, maybe I should check the other side."
The Trap: Humans tend to trust confident AI. If the AI acts like a stubborn expert who never changes its mind, humans might follow it blindly, leading to bad decisions.

The Takeaway

Large Language Models are not flexible, curious learners. They are efficient but rigid optimizers. They are great at finding a path and sticking to it, but terrible at realizing when they need to change direction.

In short: If you treat an AI like a human partner who can adapt and learn, you might be in for a surprise. It's more like a very confident dog that, once it decides to chase a squirrel, will chase that squirrel until it hits a wall, ignoring all other squirrels.

Here is a detailed technical summary of the paper "Rigidity in LLM Bandits with Implications for Human-AI Dyads."

1. Problem Statement

The paper addresses a critical gap in Large Language Model (LLM) evaluation: while current benchmarks measure accuracy, they fail to capture the decision tendencies and biases LLMs exhibit in interactive, sequential decision-making contexts.

The Core Issue: LLMs are increasingly used as advisors in human-AI dyads. If LLMs possess inherent biases (e.g., stubbornness, rigidity, or poor exploration-exploitation trade-offs), these could systematically distort human judgment, leading to "false certainty" or premature commitment to suboptimal options.
The Hypothesis: LLMs may not behave like rational agents or humans in uncertain environments. Instead, they might amplify minor positional cues into rigid policies and fail to dynamically adjust their exploration strategies based on environmental opportunity.

2. Methodology

The authors employed a Two-Arm Bandit paradigm, treating LLMs as "participants" in a reinforcement learning task to isolate decision biases without the confounds of complex task semantics.

Experimental Design

Models Tested: DeepSeek, GPT-4.1, and Gemini-2.5.
Scale: 200 independent simulations per condition, with 100 trials per simulation ( $N=200 \times 100$ ).
Conditions:
1. Symmetric Rewards: Both arms have equal reward probability ( $p=0.25$ ). An optimal agent should explore both arms equally (50/50 split).
2. Asymmetric Rewards: One arm is superior ( $p=0.75$ ) and the other inferior ( $p=0.25$ ). An optimal agent should exploit the superior arm but occasionally re-check the inferior one.
Decoding Configurations: Four strategies manipulating temperature and top-p (with top-k fixed at default):
- Strict: Temp=0.0, Top-p=0.5
- Moderate: Temp=1.0, Top-p=0.5
- Default-like: Temp=1.0, Top-p=1.0
- Exploratory: Temp=2.0, Top-p=1.0
Interaction: The LLM was prompted as a space explorer choosing between Planet X and Y. Responses were constrained to a single character ('X' or 'Y') to ensure binary choices.

Computational Modeling

To explain the behavioral patterns mechanistically, the authors fitted a Hierarchical Rescorla-Wagner (RW) model with a Softmax policy using Stan.

Learning Rate ( $A$ ): Controls how quickly value estimates update based on prediction errors.
Inverse Temperature ( $\tau$ ): Controls the determinism of the choice policy. High $\tau$ implies greedy, deterministic choices; low $\tau$ implies random exploration.
Inference: Group-level hyperparameters ( $\mu_A, \mu_\tau$ ) were estimated to summarize typical strategies across models and conditions.

3. Key Results

A. Symmetric Condition (Ambiguity)

Positional Bias Amplification: Despite equal rewards, LLMs did not split choices 50/50. Instead, they amplified the initial positional cue (often choosing 'X' first) into a stubborn one-arm policy.
Rigidity: Models rarely switched arms after a loss (Loss-Shift $\approx$ 0) or a win (Win-Shift $\approx$ 0).
Decoding Impact: Increasing temperature or top-p reduced bias slightly but did not eliminate the fundamental rigidity. The "Strict" strategy showed the highest stubbornness (e.g., DeepSeek stubbornness rate $\approx$ 0.97).

B. Asymmetric Condition (Clarity)

Rigid Exploitation: Models quickly identified the superior arm but exploited it too rigidly. They failed to re-check the inferior arm sufficiently, leading to underperformance compared to an optimal oracle.
Efficiency vs. Flexibility: While total rewards were high, the lack of "re-checking" indicates a failure to hedge against false certainty.
Exploration Failure: Even under "Exploratory" decoding settings, models struggled to balance exploitation with necessary verification.

C. Computational Modeling Insights

The hierarchical fit revealed the underlying mechanism for these behaviors:

Low Learning Rates ( $\mu_A$ ): Values were uniformly low ( $\approx 0.09 - 0.33$ ), indicating that LLMs update their internal value estimates very slowly. Early fluctuations are therefore entrenched.
Extremely High Inverse Temperatures ( $\mu_\tau$ ): Values clustered near the ceiling ( $\approx 4.99 - 5.00$ $\approx 4.99 - 5.00$ ), indicating near-deterministic policy.
- Interpretation: The combination of slow learning (low $A$ ) and extreme determinism (high $\tau$ ) explains why LLMs get "stuck" on early choices and refuse to explore alternatives even when evidence suggests they should.
Robustness: These parameters remained consistent across different decoding configurations, suggesting the bias is structural to the model's decision process rather than an artifact of sampling noise.

4. Key Contributions

Identification of "Epistemic Inertia": The paper defines a specific failure mode where LLMs treat uncertainty as noise to be eliminated rather than information to be harvested. They default to a single, efficiency-oriented strategy that lacks dynamic adjustment.
Methodological Framework: Demonstrates that minimal bandit tasks are a tractable and powerful probe for LLM decision tendencies, superior to static accuracy benchmarks for revealing interactive biases.
Computational Characterization: Provides the first quantitative evidence (via hierarchical RL modeling) that LLM decision rigidity stems from a specific combination of low learning rates and hyper-deterministic policies.
Decoding Limitations: Shows that standard decoding knobs (temperature, top-p) cannot fundamentally alter the underlying decision strategy; they only change the "appearance" of behavior or increase format errors.

5. Significance and Implications

Human-AI Dyads: The findings suggest significant risks when LLMs act as advisors. Their "confident" and deterministic outputs can lead humans to:
- Accept false positives under ambiguity (due to positional bias).
- Prematurely lock into unverified options.
- Fail to consider rare but consequential alternatives (due to lack of re-checking).
Systemic Bias: The "order effects" in prompts act as a choice architecture that shapes model output, which in turn shapes user reasoning.
Future Directions: The paper calls for new benchmarks involving non-stationary environments and social decision tasks to test if models can adapt when stability is not rewarded. It also suggests that future models need mechanisms to dynamically tune exploration based on expected information gain, rather than defaulting to rigid exploitation.

In conclusion, the paper argues that LLMs are not merely "stochastic parrots" but exhibit systematic, rigid decision biases that can negatively impact human-AI collaboration by creating an illusion of certainty where flexibility is required.