ICLR 2026 Orals

SWINGARENA: Adversarial Programming Arena for Long-context GitHub Issue Solving

Wendong XU, Jing Xiong, Chenyang Zhao, Qiujiang Chen, Haoran Wang, Hui Shen, Zhongwei Wan, Jianbo Dai, Taiqiang Wu, He Xiao, Chaofan Tao, Zhuoqing Mao, Ying Sheng, Zhijiang Guo, Hongxia Yang, Bei Yu, Lingpeng Kong, Quanquan Gu, Ngai Wong

LLMs & Reasoning Fri, Apr 24 · 3:15 PM–3:25 PM · 203 A/B Avg rating: 6.00 (4–8)

Abstract

We present \textsc{SwingArena}, a adversarial evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, \textsc{SwingArena} models the collaborative process of software iteration by pairing LLMs as \textit{submitters}, who generate patches, and \textit{reviewers}, who create test cases and verify the patches through continuous integration (CI) pipelines. To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large codebases, supporting multiple programming languages (C++, Python, Rust, and Go). This enables the framework to scale across diverse tasks and contexts while respecting token limitations. Our experiments, using over 400 high-quality real-world GitHub issues selected from a pool of 2,300 issues, show that models like GPT-4o excel at aggressive patch generation, whereas DeepSeek and Gemini prioritize correctness in CI validation. \textsc{SwingArena} presents a scalable and extensible methodology for evaluating LLMs in realistic, CI-driven software development settings.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

SwingArena evaluates LLMs on GitHub issue solving via adversarial framework modeling submitter-reviewer collaboration with retrieval-augmented code generation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Introduces SwingArena, adversarial evaluation framework modeling collaborative software development with submitter and reviewer agents
  • Proposes Retrieval-Augmented Code Generation (RACG) module for handling long-context challenges in large codebases across multiple languages
  • Demonstrates framework reveals different model strengths: GPT-4o excels at aggressive patch generation, DeepSeek and Gemini prioritize correctness
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Retrieval-augmented code generation
  • Static analysis
  • Dense retrieval
  • Continuous integration pipelines
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • GitHub issues
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • RACG effectiveness depends on performance of constituent parts (BM25 retrieval, syntax-aware chunking, dense reranking) and degrades with poorly structured or domain-specific codebases
    from the paper
  • Fixed-size retrieval window (5 files with 16 chunks) represents practical trade-off but may fail to capture all relevant context in extremely large monorepos
    from the paper
  • Approximately 15% of failure cases attributable to retrieval phase failing to find critical context
    from the paper
  • Computational overhead of simulating full CI pipelines iteratively makes evaluations resource-intensive compared to static benchmarks
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Implement iterative retrieval where models dynamically initiate new retrieval requests based on existing context
    from the paper
  • Explore hierarchical retrieval combining coarse-grained file-level with fine-grained code block-level retrieval
    from the paper
  • Investigate hybrid retrieval combining sparse retrieval with structured retrieval based on code graphs
    from the paper

Author keywords

  • Arena
  • Real-World GitHub Issues
  • Adversarial Programming
  • Retrieval-Augmented Generation
  • Continuous Integration
  • Code Benchmark

Related orals

Something off? Let us know →