CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song
Abstract
AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of 34 zero-day vulnerabilities and 18 historically incomplete patches. These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.
CyberGym benchmarks AI agents on 1,507 real-world vulnerabilities discovering 34 zero-days, showing top models achieve only 22% success on PoC generation.
- Large-scale benchmark with 1,507 real-world vulnerabilities across 188 open-source projects
- Adjustable settings for vulnerability analysis covering full CRUD spectrum
- Discovered 34 zero-day vulnerabilities and 18 historically incomplete patches
- Security benchmarking
- AI agents
- Proof-of-concept generation
- Vulnerability analysis
- CyberGym
Authors did not state explicit limitations.
Expand beyond C/C++ memory safety issues to logic flaws, cryptographic weaknesses across web and mobile platforms
from the paperSupport broader range of programming languages
from the paperExtend to defensive measures like patching and offensive measures like exploitation
from the paperStandardize heterogeneous test systems and reliable vulnerability detection for patch evaluation
from the paper
Author keywords
- Cybersecurity
- AI
- Agents
Related orals
On the Wasserstein Geodesic Principal Component Analysis of probability measures
Geodesic PCA for probability distributions using Wasserstein geometry with neural network parametrization for continuous distributions.
TabStruct: Measuring Structural Fidelity of Tabular Data
TabStruct benchmark evaluates tabular data generators on structural fidelity and conventional dimensions using global utility metric without ground-truth causal structures.
Monocular Normal Estimation via Shading Sequence Estimation
RoSE estimates surface normals via shading sequence prediction, addressing 3D misalignment in monocular normal estimation.
TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
TTSDS2 metric robustly correlates with human judgments for TTS evaluation across diverse speech domains maintaining >0.5 Spearman correlation.
World-In-World: World Models in a Closed-Loop World
Introduces closed-loop benchmark evaluating generative world models on embodied task performance rather than visual quality.