ICLR 2026 Orals

LLMs Get Lost In Multi-Turn Conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, Jennifer Neville

LLMs & Reasoning Thu, Apr 23 · 3:39 PM–3:49 PM · 203 A/B Avg rating: 8.00 (6–10)
Author-provided TL;DR

We discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.

Abstract

Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Study showing LLMs exhibit 39% average performance drop in multi-turn conversations, failing to recover from wrong contextual assumptions.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Large-scale simulation evidence that all major LLMs perform significantly worse in multi-turn vs single-turn settings
  • Decomposition of performance degradation into minor aptitude loss and significant reliability increase
  • Finding that LLMs make premature assumptions and over-rely on previous responses
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • simulation experiments
  • multi-turn conversation analysis
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • six generation task benchmarks
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Call on LLM builders to prioritize multi-turn reliability, as known remediations for simpler settings prove ineffective
    from the paper

Author keywords

  • multi-turn
  • underspecification
  • llm simulation

Related orals

Something off? Let us know →