Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

Gabriel Grand, Valerio Pepe, Joshua B. Tenenbaum, Jacob Andreas

LLMs & Reasoning Fri, Apr 24 · 10:42 AM–10:52 AM · Amphitheater Avg rating: 6.67 (4–8)

Author-provided TL;DR

We introduce a collaborative Battleship task to evaluate information-seeking in humans and agents; insights from Bayesian Experimental Design (BED) yield inference-time strategies for building resource-rational agents in discovery settings.

Abstract

Many emerging applications of AI—from scientific discovery to medical diagnosis—require agents to seek information strategically: forming hypotheses, asking targeted questions, and making decisions under uncertainty. In high-stakes settings with limited resources, do language models (LMs) behave like rational agents? Drawing on insights from human cognition, we develop methods to evaluate and enhance agentic information-seeking. First, we introduce a decision-oriented dialogue task called Collaborative Battleship, in which a Captain must balance exploration (asking questions) and action (taking shots), while a Spotter must supply accurate, contextually-grounded answers. Compared to human players (N=42), we find that many LM agents struggle to ask informative questions, produce accurate answers, and identify high-utility actions. To address these gaps, we develop novel Monte Carlo inference strategies for LMs inspired by Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303–0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% → 82% win rate) and frontier models (0% → 67% win rate vs. GPT-5) at ≈1% of GPT-5's cost. We replicate these findings on Guess Who?, where our methods significantly boost accuracy (+28.3–42.4 p.p.), demonstrating their general applicability for building information-seeking agents.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Develops methods for LMs to ask informative questions and make decisions under uncertainty using Bayesian Experimental Design.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Collaborative Battleship task replicating core Bayesian Experimental Design components for evaluating agent information-seeking
Monte Carlo inference strategies for LMs inspired by BED improve accuracy up to 14.7% and expected information gain up to 94.2%
Demonstrates weaker LMs can outperform humans and frontier models at resource-rational agents

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Bayesian Experimental Design
Monte Carlo inference
Language models
In-context learning

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

Collaborative Battleship task
Guess Who? task

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Pragmatic behaviors not explicitly modeled; incorporating rational speech acts framework could improve agent sophistication
from the paper
Fixed epsilon does not account for differences in reliability across individual information sources
from the paper
Reliance on efficient sampling from generative world model; in general settings may require learning model via code synthesis or diffusion
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Building agents that collaborate effectively with people is increasingly important and Collaborative Battleship provides ideal setting
from the paper

Author keywords

Bayesian experimental design
information-seeking
question asking
Collaborative Battleship
expected information gain (EIG)
explore-exploit tradeoffs
resource rationality
probabilistic inference
Monte Carlo sampling
symbolic grounding
code generation
reasoning
decision-oriented dialogue
cognitive modeling
human behavior
language model agents
scientific discovery

Something off? Let us know →

Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis