SWINGARENA: Adversarial Programming Arena for Long-context GitHub Issue Solving
Wendong XU, Jing Xiong, Chenyang Zhao, Qiujiang Chen, Haoran Wang, Hui Shen, Zhongwei Wan, Jianbo Dai, Taiqiang Wu, He Xiao, Chaofan Tao, Zhuoqing Mao, Ying Sheng, Zhijiang Guo, Hongxia Yang, Bei Yu, Lingpeng Kong, Quanquan Gu, Ngai Wong
Abstract
We present \textsc{SwingArena}, a adversarial evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, \textsc{SwingArena} models the collaborative process of software iteration by pairing LLMs as \textit{submitters}, who generate patches, and \textit{reviewers}, who create test cases and verify the patches through continuous integration (CI) pipelines. To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large codebases, supporting multiple programming languages (C++, Python, Rust, and Go). This enables the framework to scale across diverse tasks and contexts while respecting token limitations. Our experiments, using over 400 high-quality real-world GitHub issues selected from a pool of 2,300 issues, show that models like GPT-4o excel at aggressive patch generation, whereas DeepSeek and Gemini prioritize correctness in CI validation. \textsc{SwingArena} presents a scalable and extensible methodology for evaluating LLMs in realistic, CI-driven software development settings.
SwingArena evaluates LLMs on GitHub issue solving via adversarial framework modeling submitter-reviewer collaboration with retrieval-augmented code generation.
- Introduces SwingArena, adversarial evaluation framework modeling collaborative software development with submitter and reviewer agents
- Proposes Retrieval-Augmented Code Generation (RACG) module for handling long-context challenges in large codebases across multiple languages
- Demonstrates framework reveals different model strengths: GPT-4o excels at aggressive patch generation, DeepSeek and Gemini prioritize correctness
- Retrieval-augmented code generation
- Static analysis
- Dense retrieval
- Continuous integration pipelines
- GitHub issues
RACG effectiveness depends on performance of constituent parts (BM25 retrieval, syntax-aware chunking, dense reranking) and degrades with poorly structured or domain-specific codebases
from the paperFixed-size retrieval window (5 files with 16 chunks) represents practical trade-off but may fail to capture all relevant context in extremely large monorepos
from the paperApproximately 15% of failure cases attributable to retrieval phase failing to find critical context
from the paperComputational overhead of simulating full CI pipelines iteratively makes evaluations resource-intensive compared to static benchmarks
from the paper
Implement iterative retrieval where models dynamically initiate new retrieval requests based on existing context
from the paperExplore hierarchical retrieval combining coarse-grained file-level with fine-grained code block-level retrieval
from the paperInvestigate hybrid retrieval combining sparse retrieval with structured retrieval based on code graphs
from the paper
Author keywords
- Arena
- Real-World GitHub Issues
- Adversarial Programming
- Retrieval-Augmented Generation
- Continuous Integration
- Code Benchmark
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.