Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

Imagine you are playing a game of Battleship, but with a twist. You have a partner who can see the entire ocean map, but you can only see a tiny, foggy patch of water around your ship. Your goal is to find and sink your partner's hidden ships.

To do this, you have two choices every turn:

Shoot: Guess where a ship is and fire a cannon.
Ask: Ask your partner a "Yes" or "No" question to get a clue (e.g., "Is there a ship in the top-left corner?").

The problem? Most current AI models are terrible at this game. They either shoot wildly without thinking, or they ask silly questions that don't help them find the ships. They act like a person who "shoots first and asks questions later," often missing the target.

This paper, titled "Shoot First, Ask Questions Later? Building Rational Agents That Explore and Act Like People," introduces a new way to teach AI to play this game (and similar information-seeking tasks) much smarter.

Here is the breakdown of their discovery, using simple analogies:

1. The Problem: The "Guessing Game" AI

The researchers set up a digital version of Battleship where an AI (the Captain) has to talk to another AI (the Spotter) who sees the whole board.

The Captain's Job: Decide whether to ask a question or take a shot.
The Spotter's Job: Answer "Yes" or "No" accurately based on what they see.

They found that even smart AI models were struggling. They asked redundant questions (like asking "Is there a ship?" when they already knew the answer) or made shots that were pure guesses. They weren't acting like "rational" agents who use logic to save resources.

2. The Solution: The "Detective's Toolkit"

The authors realized that to be good at this, an AI needs to act like a detective or a scientist running an experiment. They borrowed a concept from statistics called Bayesian Experimental Design.

Think of it like this:

The Old Way: The AI just picks a question that sounds interesting.
The New Way (The "Bayesian" Way): The AI runs a mental simulation. It asks itself: "If I ask this question, how much will it narrow down the list of possibilities?"

They gave the AI three specific tools (strategies) to use:

A. The "Best Question" Filter (Bayes-Q)

Imagine you have a deck of cards face down, and you want to find the Ace.

Bad Question: "Is the Ace red?" (This only splits the deck in half).
Good Question: "Is the Ace the Ace of Spades?" (This is too specific).
The AI's New Strategy: The AI generates 100 possible questions, simulates the answer for each, and picks the one that cuts the "search space" in half the most efficiently. It's like using a metal detector that beeps the loudest exactly where the treasure is, rather than digging randomly.

B. The "Best Shot" Calculator (Bayes-M)

When it's time to shoot, the AI doesn't just guess. It looks at all the possible places a ship could be based on previous clues and calculates the exact probability of a hit. It's like a sniper who calculates wind speed, distance, and target movement before pulling the trigger.

C. The "Timing" Coach (Bayes-D)

This is the most human-like part. The AI learns when to ask and when to shoot.

Weak AI: Asks all 15 allowed questions at the very start, then shoots blindly. (Like reading the whole instruction manual before turning on the machine).
Smart AI: Asks a few questions, takes a shot, sees the result, asks another question, and takes another shot. It balances gathering info with taking action, just like a human expert would.

3. The Results: Superhuman Performance

The results were surprising and impressive:

Weak AI becomes a Grandmaster: They took a small, relatively "dumb" AI model (Llama-4-Scout) and gave it this "Detective's Toolkit." Suddenly, it didn't just play well; it beat human players 82% of the time and even beat the world's strongest AI (GPT-5) 67% of the time.
Cost Efficiency: The small AI did this at 1% of the cost of the giant AI. It's like teaching a smart kid to solve a math problem using a clever trick, rather than hiring a team of expensive professors to do the math for them.
Accuracy: For the "Spotter" (the one answering), using code to generate answers made them nearly perfect (94% accuracy), whereas just talking made them make mistakes.

4. Why This Matters

This isn't just about a board game. The authors tested this on another game called Guess Who? (where you guess a person's identity by asking yes/no questions) and got the same amazing results.

The Big Picture:
In the real world, AI is being used for things like medical diagnosis (asking the right questions to a patient to find a disease) or scientific discovery (designing experiments to find new drugs).

Currently, AI often asks the wrong questions or wastes resources.
This paper shows that if we teach AI to think probabilistically—to simulate outcomes and choose the path that gives the most information with the least effort—we can build agents that are not just "chatbots," but rational partners that can solve complex problems efficiently.

Summary Analogy

Imagine you are looking for a lost key in a messy room.

Normal AI: Starts picking up random objects and checking them, or asks, "Is the key under the sofa?" without checking if the sofa is even in the room.
This New "Rational" AI: First, it looks at the room and thinks, "The key is most likely near the door." It checks there first. If it's not there, it asks, "Did I leave the key in the kitchen?" based on a logical deduction of where it could be. It doesn't waste time checking the ceiling fan.

By giving AI this "logical brain," the researchers turned a clumsy guesser into a master strategist.

Here is a detailed technical summary of the paper "Shoot First, Ask Questions Later? Building Rational Agents That Explore and Act Like People" (ICLR 2026).

1. Problem Statement

The paper addresses a critical gap in the capabilities of Large Language Models (LLMs) when transitioning from passive chat assistants to active, strategic agents. While LLMs excel at answering queries, they often struggle to:

Strategically seek information: Formulating hypotheses and asking targeted questions to reduce uncertainty in combinatorially vast spaces.
Balance exploration vs. exploitation: Deciding when to gather information (ask questions) versus when to act (take a shot/guess) under resource constraints.
Ground reasoning: Providing accurate answers based on complex, context-dependent game states.

The authors posit that current LLMs, even frontier models, do not inherently behave as rational agents in high-stakes, limited-resource environments. They lack the ability to perform Bayesian Experimental Design (BED) effectively without explicit guidance.

2. Methodology

A. The Environment: Collaborative Battleship

The authors introduce Collaborative Battleship, a two-player dialogue and decision-making task adapted from cognitive science literature.

Roles:
- Captain: Has partial visibility of an 8x8 board. Must decide whether to ask a question (exploration) or shoot a tile (exploitation) to sink hidden ships. Limited to 15 questions and 40 shots.
- Spotter: Has full visibility of the board but is restricted to answering Yes/No only.
Dataset (BATTLESHIPQA): The authors collected 126 full human-human game trajectories (N=42 participants), creating a multimodal dataset with 931 gold-labeled questions. This dataset is split into:
- SpotterQA: Evaluates grounded answering (Yes/No) based on board state and dialogue history.
- CaptainQA: Evaluates full strategic gameplay (question selection, move selection, and decision-making).

B. Formal Framework: Bayesian Experimental Design (BED)

The core theoretical contribution is casting the agent's decision process as a Bayesian inference problem.

Belief State: The agent maintains a posterior distribution $\pi_t(s)$ over possible board configurations $s$ , updated via Sequential Monte Carlo (SMC) particle filtering as new observations (answers) arrive.
Noise Model: The Spotter is modeled as a Binary Symmetric Channel (BSC) with error rate $\epsilon$ , acknowledging that agents (human or AI) may make mistakes.
Strategies: The authors propose three specific inference-time strategies:
1. $Q_{Bayes}$ (Question Selection): Selects questions that maximize Expected Information Gain (EIG). It samples candidate questions, simulates their answers via the belief state, and chooses the one that most reduces entropy.
2. $M_{Bayes}$ (Move Selection): Selects the tile with the highest probability of containing a ship under the current belief distribution (MAP action).
3. $D_{Bayes}$ (Decision Making): Uses a one-step lookahead to decide between asking a question or taking a shot. It compares the current hit probability against the expected hit probability after receiving the answer to a potential question, discounted by a factor $\gamma$ .

C. Implementation Techniques

Code Generation for Grounding: For the Spotter role, the paper introduces a "Code" strategy where the LLM generates Python code to compute the answer based on the board state, rather than answering directly. This significantly improves grounding and accuracy.
Inference Scaling: The Bayesian strategies involve sampling multiple candidate questions or board states to approximate the optimal decision, effectively scaling up the model's reasoning capabilities at inference time.

3. Key Contributions

BATTLESHIPQA Dataset: A novel, high-quality dataset capturing rich pragmatic phenomena in grounded dialogue, including discourse dependence, state dependence, vagueness, and ambiguity.
Rational Agent Framework: A modular framework combining LLMs with Bayesian inference (SMC, EIG maximization) to create agents that explicitly reason about uncertainty and resource constraints.
Inference-Time Strategies: The demonstration that "weak" LLMs can achieve superhuman performance when augmented with these specific Bayesian strategies, without requiring retraining.
Generalizability: The framework is successfully applied to a second domain, Guess Who?, demonstrating its applicability beyond grid-based spatial reasoning to object-relational semantic spaces.

4. Key Results

A. SpotterQA (Answering)

Code Generation: Using Python code generation to answer questions improved accuracy by 14.7% over direct answering and Chain-of-Thought (CoT) baselines across 15 models.
Performance Gap: While frontier models (e.g., GPT-5) matched human performance (~92.5% accuracy), weaker models (e.g., Llama-4-Scout) struggled significantly on "complex" (context-dependent) questions. Code generation helped close this gap but did not fully eliminate it for complex queries.

B. CaptainQA (Strategic Play)

Superhuman Performance: The combination of all three Bayesian strategies ( $Q_{Bayes} + M_{Bayes} + D_{Bayes}$ $Q_{B a y es} + M_{B a y es} + D_{B a y es}$ ) allowed weaker models to outperform both humans and frontier models:
- Llama-4-Scout: Win rate against humans increased from 8% to 82%. Win rate against GPT-5 increased from 0% to 67%.
- GPT-4o: Win rate against GPT-5 reached 67%.
Efficiency: These improvements were achieved at a fraction of the cost. Llama-4-Scout with Bayesian strategies cost ~1% of GPT-5's inference cost while outperforming it.
Question Quality: The Bayesian question selection ( $Q_{Bayes}$ ) raised the Expected Information Gain (EIG) of questions to 94.2% of the theoretical noise ceiling and virtually eliminated redundant questions (EIG=0).
Human Comparison: Humans naturally balance exploration and exploitation but are not Bayes-optimal. They ask fewer questions overall but rely on high-quality intuition. The Bayesian agents mimic this efficiency but with mathematically optimal information gathering.

C. Generalization (Guess Who?)

The framework improved success rates in the "Guess Who?" game significantly:
- Llama-4-Scout: 30.0% $\to$ 72.4%.
- GPT-4o: 61.7% $\to$ 90.0%.

5. Significance and Implications

Rationality vs. Capability: The paper demonstrates that the primary limitation of current LLMs in complex decision-making is not necessarily a lack of "intelligence" or knowledge, but a lack of structured reasoning processes. By injecting rational, Bayesian inference loops, even smaller models can outperform larger, unguided models.
Cost-Effective AI: The findings suggest a path toward building highly capable, rational agents using smaller, cheaper models (like Llama-4-Scout) augmented with inference-time computation, rather than relying solely on massive, expensive frontier models.
Human-AI Collaboration: The study highlights that while humans are not perfect Bayesians, they are remarkably good at pragmatic reasoning. The proposed agents bridge the gap between human-like adaptability and machine-like optimality, making them suitable for real-world applications like scientific discovery and medical diagnosis where strategic information seeking is crucial.
Resource Rationality: The work aligns with the "resource rational" view of cognition, showing that agents can be designed to maximize utility given computational constraints, effectively mimicking human-like trade-offs between effort and accuracy.

In conclusion, the paper argues that "shooting first" (acting without sufficient information) is a failure of current LLMs, but by equipping them with "Bayesian eyes" (inference-time strategies), we can build agents that ask the right questions, act strategically, and collaborate effectively with humans.