LLMs Get Lost In Multi-Turn Conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, Jennifer Neville

LLMs & Reasoning Thu, Apr 23 · 3:39 PM–3:49 PM · 203 A/B Avg rating: 8.00 (6–10)

Author-provided TL;DR

We discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.

Abstract

Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Study showing LLMs exhibit 39% average performance drop in multi-turn conversations, failing to recover from wrong contextual assumptions.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Large-scale simulation evidence that all major LLMs perform significantly worse in multi-turn vs single-turn settings
Decomposition of performance degradation into minor aptitude loss and significant reliability increase
Finding that LLMs make premature assumptions and over-rely on previous responses

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

simulation experiments
multi-turn conversation analysis

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

six generation task benchmarks

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Call on LLM builders to prioritize multi-turn reliability, as known remediations for simpler settings prove ineffective
from the paper

Author keywords

multi-turn
underspecification
llm simulation

Something off? Let us know →

LLMs Get Lost In Multi-Turn Conversation

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis