ICLR 2026 Orals

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song

Datasets, Benchmarks & Evaluation Sat, Apr 25 · 3:27 PM–3:37 PM · 204 A/B Avg rating: 7.00 (6–8)

Abstract

AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of 34 zero-day vulnerabilities and 18 historically incomplete patches. These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

CyberGym benchmarks AI agents on 1,507 real-world vulnerabilities discovering 34 zero-days, showing top models achieve only 22% success on PoC generation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Large-scale benchmark with 1,507 real-world vulnerabilities across 188 open-source projects
  • Adjustable settings for vulnerability analysis covering full CRUD spectrum
  • Discovered 34 zero-day vulnerabilities and 18 historically incomplete patches
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Security benchmarking
  • AI agents
  • Proof-of-concept generation
  • Vulnerability analysis
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • CyberGym
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Expand beyond C/C++ memory safety issues to logic flaws, cryptographic weaknesses across web and mobile platforms
    from the paper
  • Support broader range of programming languages
    from the paper
  • Extend to defensive measures like patching and offensive measures like exploitation
    from the paper
  • Standardize heterogeneous test systems and reliable vulnerability detection for patch evaluation
    from the paper

Author keywords

  • Cybersecurity
  • AI
  • Agents

Related orals

Something off? Let us know →