SWINGARENA: Adversarial Programming Arena for Long-context GitHub Issue Solving

Wendong XU, Jing Xiong, Chenyang Zhao, Qiujiang Chen, Haoran Wang, Hui Shen, Zhongwei Wan, Jianbo Dai, Taiqiang Wu, He Xiao, Chaofan Tao, Zhuoqing Mao, Ying Sheng, Zhijiang Guo, Hongxia Yang, Bei Yu, Lingpeng Kong, Quanquan Gu, Ngai Wong

LLMs & Reasoning Fri, Apr 24 · 3:15 PM–3:25 PM · 203 A/B Avg rating: 6.00 (4–8)

OpenReview ↗ PDF ↗ iclr.cc ↗

Abstract

We present \textsc{SwingArena}, a adversarial evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, \textsc{SwingArena} models the collaborative process of software iteration by pairing LLMs as \textit{submitters}, who generate patches, and \textit{reviewers}, who create test cases and verify the patches through continuous integration (CI) pipelines. To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large codebases, supporting multiple programming languages (C++, Python, Rust, and Go). This enables the framework to scale across diverse tasks and contexts while respecting token limitations. Our experiments, using over 400 high-quality real-world GitHub issues selected from a pool of 2,300 issues, show that models like GPT-4o excel at aggressive patch generation, whereas DeepSeek and Gemini prioritize correctness in CI validation. \textsc{SwingArena} presents a scalable and extensible methodology for evaluating LLMs in realistic, CI-driven software development settings.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

SwingArena evaluates LLMs on GitHub issue solving via adversarial framework modeling submitter-reviewer collaboration with retrieval-augmented code generation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Introduces SwingArena, adversarial evaluation framework modeling collaborative software development with submitter and reviewer agents
Proposes Retrieval-Augmented Code Generation (RACG) module for handling long-context challenges in large codebases across multiple languages
Demonstrates framework reveals different model strengths: GPT-4o excels at aggressive patch generation, DeepSeek and Gemini prioritize correctness

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Retrieval-augmented code generation
Static analysis
Dense retrieval
Continuous integration pipelines

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

GitHub issues

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

RACG effectiveness depends on performance of constituent parts (BM25 retrieval, syntax-aware chunking, dense reranking) and degrades with poorly structured or domain-specific codebases
from the paper
Fixed-size retrieval window (5 files with 16 chunks) represents practical trade-off but may fail to capture all relevant context in extremely large monorepos
from the paper
Approximately 15% of failure cases attributable to retrieval phase failing to find critical context
from the paper
Computational overhead of simulating full CI pipelines iteratively makes evaluations resource-intensive compared to static benchmarks
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Implement iterative retrieval where models dynamically initiate new retrieval requests based on existing context
from the paper
Explore hierarchical retrieval combining coarse-grained file-level with fine-grained code block-level retrieval
from the paper
Investigate hybrid retrieval combining sparse retrieval with structured retrieval based on code graphs
from the paper

Author keywords

Arena
Real-World GitHub Issues
Adversarial Programming
Retrieval-Augmented Generation
Continuous Integration
Code Benchmark

Something off? Let us know →

SWINGARENA: Adversarial Programming Arena for Long-context GitHub Issue Solving

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis