ICLR 2026 Orals

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions

Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, Linheng Han, Edward Alexander, Xintong Zhu, Rui Qin, Ruihan Yu, Yiyao Jin, Feige Zhou, Weihao Zhong, Yun Chen, Hongyu Liu, Chenhao Ma, Fatma Ozcan, Yannis Papakonstantinou, Reynold Cheng

LLMs & Reasoning Fri, Apr 24 · 3:27 PM–3:37 PM · 203 A/B Avg rating: 7.50 (6–8)

Abstract

Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short of capturing this complexity, either by treating conversation histories as static context or by limiting evaluation to narrow, read-only (SELECT-ONLY) operations, thereby potentially failing to reflect the challenges encountered in production-grade database assistant. In this work, we introduce BIRD-INTERACT, a benchmark that restores this missing realism through: (1) a **comprehensive interaction environment** that couples each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from execution errors without human supervision; (2) two **evaluation settings** reflecting real-world interaction settings which contain a pre-defined conversational protocol (c-Interact) and a more open-ended agentic setting (a-Interact) in which the model autonomously decides when to query the user simulator or explore the DB environment; (3) a **challenging task suite** that covers the full CRUD spectrum for both business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks, requiring LLMs to engage in dynamic interaction. The suite is organized into two sets: a full set (BIRD-INTERACT-FULL) of 600 tasks which unfold up to 11,796 dynamic interactions for a comprehensive overview of performance and a lite set (BIRD-INTERACT-LITE) of 300 tasks, with simplified databases for detailed behavioral analysis of interactions, and fast development of methods. Our empirical results highlight the difficulty of BIRD-INTERACT: the most recent flagship model GPT-5 completes only 8.67% of tasks in the c-Interact setting and 17.00% in the a-Interact setting on the full task suite. Further analysis via memory grafting and Interaction Test-time Scaling (ITS) validates the importance of effective interaction for achieving success in dynamic text-to-SQL tasks.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

BIRD-INTERACT benchmark evaluates LLMs on dynamic multi-turn text-to-SQL tasks with function-driven user simulator and dual interaction settings.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • BIRD-INTERACT benchmark with 600 tasks and comprehensive interaction environment
  • Function-driven user simulator enabling models to solicit clarifications and recover from errors
  • Two evaluation settings: conversational protocol and agentic planning modes
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Text-to-SQL generation
  • LLM agents
  • Interactive systems
  • Benchmark design
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • BIRD-INTERACT-FULL
  • BIRD-INTERACT-LITE
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Develop post-trained human-aligned local user simulator via post-training for more reliable response patterns
    from the paper
  • Conduct experiments in free-mode setting without budget constraints to observe natural interaction strategies
    from the paper

Author keywords

  • Interactive
  • Text-to-SQL
  • LLM
  • Code Generation

Related orals

Something off? Let us know →