CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song

Datasets, Benchmarks & Evaluation Sat, Apr 25 · 3:27 PM–3:37 PM · 204 A/B Avg rating: 7.00 (6–8)

Abstract

AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of 34 zero-day vulnerabilities and 18 historically incomplete patches. These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

CyberGym benchmarks AI agents on 1,507 real-world vulnerabilities discovering 34 zero-days, showing top models achieve only 22% success on PoC generation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Large-scale benchmark with 1,507 real-world vulnerabilities across 188 open-source projects
Adjustable settings for vulnerability analysis covering full CRUD spectrum
Discovered 34 zero-day vulnerabilities and 18 historically incomplete patches

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Security benchmarking
AI agents
Proof-of-concept generation
Vulnerability analysis

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

CyberGym

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Expand beyond C/C++ memory safety issues to logic flaws, cryptographic weaknesses across web and mobile platforms
from the paper
Support broader range of programming languages
from the paper
Extend to defensive measures like patching and offensive measures like exploitation
from the paper
Standardize heterogeneous test systems and reliable vulnerability detection for patch evaluation
from the paper

Author keywords

Cybersecurity
AI
Agents

Something off? Let us know →

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

Abstract

Author keywords

Related orals

On the Wasserstein Geodesic Principal Component Analysis of probability measures

TabStruct: Measuring Structural Fidelity of Tabular Data

Monocular Normal Estimation via Shading Sequence Estimation

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

World-In-World: World Models in a Closed-Loop World