UC Berkeley Introduces CyberGym: A Real-World Cybersecurity Evaluation Framework to Evaluate AI Agents on Large-Scale Vulnerabilities Across Massive Codebases

Cybersecurity has become a significant area of interest in artificial intelligence, driven by the increasing reliance on large software systems and the expanding capabilities of AI tools. As threats evolve in complexity, ensuring the security of software systems has become more than just a matter of conventional protections; it now intersects with automated reasoning, vulnerability detection, and code-level comprehension. Modern cybersecurity requires tools and methods that can simulate real-world scenarios, identify hidden flaws, and validate system integrity across diverse software infrastructures. Within this environment, researchers have been developing benchmarks and methods to systematically evaluate AI agents’ ability to understand, detect, and even exploit vulnerabilities, drawing parallels with human security researchers. However, bridging the gap between AI reasoning and real-world cybersecurity complexities remains a key challenge.

Problem with Existing Benchmarks

One pressing issue is the lack of effective ways to evaluate whether AI systems are truly capable of understanding and handling security tasks under realistic conditions. Simplified benchmark tasks often dominate current testing methods, which rarely mirror the messy and layered reality of large-scale software repositories. These environments involve intricate input conditions, deep code paths, and subtle vulnerabilities that demand more than surface-level inspection. Without robust evaluation methods, it’s difficult to determine whether AI agents can be trusted to perform tasks like vulnerability detection or exploit development. More importantly, current benchmarks don’t reflect the scale and nuance of vulnerabilities found in actively maintained, widely used software systems, leaving a critical evaluation gap.

Limitations of Current Tools

Several benchmarks have been used to evaluate cybersecurity capabilities, including Cybench and the NYU CTF Bench. These focus on capture-the-flag-style tasks that offer limited complexity, typically involving small codebases and constrained test environments. Some benchmarks attempt to engage real-world vulnerabilities, but they often do so at a limited scale. Furthermore, many of the tools rely on either synthetic test cases or narrowly scoped challenge problems, which fail to represent the diversity of software inputs, execution paths, and bug types found in actual systems. Even specialized agents created for security analysis have been tested on benchmarks with only tens or a few hundred tasks, far short of the complexity of real-world threat landscapes.

Introducing CyberGym

Researchers introduced CyberGym, a large-scale and comprehensive benchmarking tool specifically designed to evaluate AI agents in real-world cybersecurity contexts. Developed at the University of California, Berkeley, CyberGym includes 1,507 distinct benchmark tasks sourced from actual vulnerabilities found and patched across 188 major open-source software projects. These vulnerabilities were originally identified by OSS-Fuzz, a continuous fuzzing campaign maintained by Google. To ensure realism, each benchmark instance includes the full pre-patch codebase, an executable, and a textual description of the vulnerability. Agents must generate a proof-of-concept test that reproduces the vulnerability in the unpatched version, and CyberGym evaluates success based on whether the vulnerability is triggered in the pre-patch version and absent in the post-patch one. This benchmark uniquely emphasizes the generation of Proof of Concepts (PoCs), a task that requires agents to traverse complex code paths and synthesize inputs to meet specific security conditions. CyberGym is modular and containerized, enabling easy expansion and reproducibility.

CyberGym Evaluation Levels

The evaluation pipeline in CyberGym is built around four levels of difficulty, each increasing the amount of input information provided. At level 0, the agent is given only the codebase with no hint of the vulnerability. Level 1 adds a natural language description. Level 2 introduces a ground-truth proof of concept (PoC) and crash stack trace, while Level 3 includes the patch itself and the post-patch codebase. Each level presents a new layer of reasoning and complexity. For instance, in level 1, agents must infer the vulnerability’s location and context purely from its textual description and codebase. To ensure benchmark quality, CyberGym applies filters such as checking the informativeness of patch commit messages, validating proof-of-concept (PoC) reproducibility, and removing redundancy by comparing stack traces. The final dataset comprises codebases with a median of 1,117 files and 387,491 lines of code, ranging up to over 40,000 files and 7 million lines of code. The patch sizes also vary, modifying a median of 1 file and seven lines, but sometimes spanning 40 files and over 3,000 lines. The vulnerabilities target various crash types, with 30.4% related to heap-buffer-overflow READ and 19.0% due to uninitialized value use.

Experimental Results

When tested against this benchmark, existing agents showed limited success. Among four agent frameworks, OpenHands, Codex, ENiGMA, and Cybench, the top performer was OpenHands combined with Claude-3.7-Sonnet, which reproduced only 11.9% of target vulnerabilities. This performance dropped significantly when dealing with longer PoC inputs, as success rates were highest for PoCs under 10 bytes (43.5%) and fell below 8% for lengths over 100 bytes. Open-source models, such as DeepSeek-V3, lagged, with only a 3.6% success rate. Even specialized models fine-tuned for code reasoning, like SWE-Gym-32B and R2E-Gym-32B, failed to generalize, scoring under 2%. Surprisingly, richer input information at higher difficulty levels increased performance: level 3 saw 17.1% success, while level 0 achieved only 3.5%. Analysis also revealed that most successful PoC reproductions occurred between 20 and 40 execution steps, with many runs exceeding 90 steps and ultimately failing. Despite these challenges, agents discovered 15 previously unknown zero-day vulnerabilities and two disclosed but unpatched ones across real-world projects, demonstrating their latent capacity for novel discovery.

Key Takeaways

Benchmark Volume and Realism: CyberGym contains 1,507 tasks derived from real, patched vulnerabilities across 188 software projects, making it the largest and most realistic benchmark of its kind.
Agent Limitations: Even the best-performing agent-model combination reproduced only 11.9% of vulnerabilities, with many combinations scoring under 5%.
Difficulty Scaling: Providing additional inputs, such as stack traces or patches, significantly improved performance, with level 3 tasks yielding a 17.1% success rate.
Length Sensitivity: Agents struggled with tasks involving long PoCs. PoCs exceeding 100 bytes, which made up 65.7% of the dataset, had the lowest success rates.
Discovery Potential: 15 new zero-day vulnerabilities were discovered by agent-generated PoCs, validating their potential use in real-world security analysis.
Model Behavior: Most successful exploits were generated early in the task execution, with diminishing returns after 80 steps.
Tool Interactions: Agents performed better when allowed to interact with tools (e.g., using ‘awk’, ‘grep’, or installing ‘xxd’) and adapt PoCs based on runtime feedback.

Conclusion

In conclusion, this study highlights a critical problem: evaluating AI in cybersecurity is not only challenging but essential for understanding its limitations and capabilities. CyberGym stands out by offering a large-scale, real-world framework for doing so. The researchers addressed the issue with a practical and detailed benchmark that forces agents to reason deeply across entire codebases, generate valid exploits, and adapt through iteration. The results make it clear that while current agents show promise, especially in discovering new bugs, there is still a long road ahead to enable AI to contribute to cybersecurity at scale reliably.

Check out the Paper, GitHub Page, Leaderboard. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post UC Berkeley Introduces CyberGym: A Real-World Cybersecurity Evaluation Framework to Evaluate AI Agents on Large-Scale Vulnerabilities Across Massive Codebases appeared first on MarkTechPost.