CyberGym Benchmark: Evaluating Offensive AI at Scale
CyberGym is a UC Berkeley cybersecurity benchmark (arXiv:2506.02548) that scores an AI agent on a blunt question: can it find real memory-safety bugs in real C/C++ open-source code, and prove it with a Proof of Concept input that crashes the pre-patch build? The suite contains 1,507 task instances across 188 projects, built on top of Google's OSS-Fuzz corpus -- roughly 7.5x the size of NYU CTF (the largest prior public benchmark at ~200 challenges) and more than 35x Cybench, CVE-Bench, AutoAdvExBench, BountyBench, or SEC-Bench. At that scale it finally has enough signal to separate frontier models from each other. It is also the benchmark where Claude Mythos Preview scored 83.1% pass@1 on the full 1,507 tasks in Anthropic's internal run -- a headline number that deserves closer reading than most will give it.
What Is CyberGym?
CyberGym was introduced in the June 2025 paper CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale by Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song at UC Berkeley (arXiv:2506.02548). The paper's premise is that prior cybersecurity benchmarks were too small, too synthetic, or too narrow to tell you whether a frontier large language model could actually do offensive work.
A few numbers frame the jump in scale. NYU CTF has about 200 challenges. Cybench has 40. CVE-Bench has 40. AutoAdvExBench, BountyBench, and SEC-Bench sit in the same 40-200 range. CyberGym has 1,507 tasks, all derived from real vulnerabilities discovered in real C/C++ projects through Google's OSS-Fuzz continuous fuzzing service. Each task gives the agent a pre-patch codebase and some amount of scaffolding (see the difficulty ladder below), and demands a reproducer that triggers a sanitizer crash. There is no partial credit, no LLM-judge grading, and no synthetic "capture the flag" abstraction on top. Either the PoC crashes the unpatched code and passes on the patched code, or the task fails.
The paper, dataset, and code are all public. The benchmark lives on HuggingFace; the evaluation harness and agent scaffolding are at github.com/sunblaze-ucb/cybergym. That transparency matters -- it lets anyone reproduce a score instead of taking a vendor's word for it.
Why this matters for security teams: If you're planning around AI agents that claim offensive capability, CyberGym is currently the most credible public measuring stick for one narrow-but-important slice of that capability: can the agent produce working proofs of concept against real memory-safety bugs? That is not the whole of offensive security, but it is a question your security program needs an answer to.
The Leaderboard (and How to Read It)
The chart below plots every published CyberGym score we could verify against a primary source. Pay attention to the source column and sample size -- these rows are not directly comparable to each other. Anthropic's Mythos entry is a full 1,507-task run at pass@1 reported in the Mythos Preview System Card (April 7, 2026). The paper results evaluate on a 300-instance subset at Level 1 (the paper's primary evaluation setting). Mixing them in one bar chart overstates the gap; separating them is the only honest way to show both.
The honest read (as of April 2026): Anthropic's Mythos number is a big outlier, reported on a non-public harness. Until a third party reproduces it on the paper's settings, treat 83.1% as a vendor claim on a larger-than-baseline sample with Anthropic's own scaffolding. The paper-evaluated ceiling for models without that scaffolding sits near 22%. Both facts are true. Both matter.
Full Leaderboard: Sort and Filter
Click a column header to sort. Use the filter chips to show only paper-evaluated results, only Anthropic's internal runs, or only Claude-family models. This is the same data as the chart above, with the source and sample-size metadata exposed as first-class columns rather than a footnote.
| # | Model | Score | Source | Sample | Setting |
|---|---|---|---|---|---|
| 1 | Claude Mythos Preview | 83.1% | Anthropic | 1,507 (full) | pass@1 |
| 2 | Claude Opus 4.6 | 67% | Anthropic | Full suite | Internal run |
| 3 | Claude Sonnet 4.6 | 65% | Anthropic | Full suite | Internal run |
| 4 | Claude Opus 4.5 | 51% | Anthropic | Full suite | Internal run |
| 5 | GPT-5 w/ thinking | 22.0% | Paper | 300-instance | Level 1 |
| 6 | Claude Sonnet 4 w/ thinking | 19.3% | Paper | 300-instance | Level 1 |
| 7 | Claude Sonnet 4 (base) | 17.9% | Paper | 300-instance | Level 1 |
| 8 | Claude 3.7 Sonnet w/ thinking | 17.7% | Paper | 300-instance | Level 1 |
| 9 | GPT-4.1 | 11.9% | Paper | 300-instance | Level 1 |
| 10 | Gemini 2.5 Flash | 4.8% | Paper | 300-instance | Level 1 |
| 11 | DeepSeek-V3 | 3.6% | Paper | 300-instance | Level 1 |
| 12 | Qwen3-235B-A22B w/ thinking | 2.7% | Paper | 300-instance | Level 1 |
| 13 | o4-mini | 2.5% | Paper | 300-instance | Level 1 |
| 14 | SWE-Gym-32B | 0.1% | Paper | 300-instance | Level 1 |
Paper-era caveat. The "Paper" rows above are frozen to the June 2025 paper run on the 300-instance Level 1 subset. Several models in that run are now retired or superseded: Claude 3.7 Sonnet and Claude Sonnet 4 (replaced by Sonnet 4.5/4.6), GPT-4.1 (superseded by GPT-5.x), and o4-mini (superseded). Treat those rows as a historical snapshot, not a current recommendation. Any current comparison needs a fresh vendor-run or community re-evaluation.
How CyberGym Scores an Agent
CyberGym defines four difficulty levels, ordered by how much information the agent receives up front. Scoring is execution-based: the agent submits a PoC, the harness runs it against both the pre-patch and post-patch build, and the task passes only if the PoC crashes the unpatched code under a sanitizer and runs cleanly against the patched code. There is no LLM-as-judge stage, which removes one common source of benchmark inflation.
Scoring in One Sentence
A PoC must trigger a sanitizer crash (AddressSanitizer, MemorySanitizer, UndefinedBehaviorSanitizer, etc.) on the pre-patch build and run cleanly on the post-patch build. Binary pass/fail. Execution-graded.
Where the Tasks Come From
All 1,507 tasks derive from Google's OSS-Fuzz, a continuous fuzzing service that finds real bugs in production open-source projects. The UC Berkeley team used automated filters plus manual validation to select bugs that are reproducible in a containerized harness -- a non-trivial engineering effort, since OSS-Fuzz crashes are often tied to specific compiler flags, build systems, and sanitizer configurations.
Scope: What CyberGym Actually Measures
CyberGym tests one category of flaw: memory-safety bugs in C/C++ projects that a sanitizer can detect at runtime. That is a meaningful slice of real-world vulnerabilities -- it includes most of what gets CVEs in browsers, media codecs, cryptographic libraries, kernels, and parsers. It is not the whole security universe.
The 28 Sanitizer Classes
Tasks are labeled with the sanitizer crash type that validates them. The benchmark covers 28 distinct classes, including:
- Buffer overflows — heap, stack, and global buffer out-of-bounds reads/writes
- Use-after-free — accessing memory after deallocation
- Null pointer dereference — classic crash primitive that sometimes escalates
- Integer overflow / underflow — detected by UBSan
- Uninitialized memory reads — MSan-detected data leaks
- Stack-buffer-overflow, double-free, memory-leak — the rest of the ASan family
If you've worked on a memory safety refactor in a legacy C codebase, this is the bug catalog you recognize. That familiarity is the benchmark's strength -- these are the exact flaws that ship, get exploited, and get CVEs. The same familiarity is its limit: CyberGym does not see flaws outside this window.
Limitations You Need to Understand Before Citing a Score
CyberGym is among the strongest public cybersecurity benchmarks as of April 2026. That does not make it complete. Five structural limits shape every score the benchmark produces; skip these and any claim you make about "offensive AI capability" will be off-base.
grep, ls, and find calls instead of reasoning. Harness quality matters as much as model quality.Real-World Impact: Zero-Days and CVEs
Here is the detail that turns CyberGym from an academic exercise into a live discussion for security teams: building the benchmark itself produced real findings. The UC Berkeley team reports that evaluation runs discovered 34 previously-unknown vulnerabilities and 18 incomplete patches in the underlying open-source projects. At the paper's June 2025 snapshot, 4 CVEs had been assigned and 10 of the issues were patched by upstream maintainers; current disclosure and patch counts will be higher.
A benchmark that discovers zero-days while it is being built is a benchmark with teeth. It also raises an ethical axis the paper acknowledges: any agent that can score well on CyberGym can, by construction, find real bugs in real software. That is the intended signal. It is also why Anthropic ties Mythos capability disclosures to an AI Safety Level 3 (ASL-3) classification and extended deployment review.
For defenders: The short-term implication is that fuzzing corpora and OSS-Fuzz-style infrastructure are now also capability elicitation harnesses. If your organization ships C/C++ code, assume that at least some adversaries will run CyberGym-style agents against your public repositories. Prioritize memory-safety hardening (ASan in CI, fuzz integration, migration to memory-safe languages where feasible) accordingly. See AI governance hub for the policy-layer implications.
Who Runs It: Agent Frameworks
The paper evaluates the frontier model lineup wrapped in four agent frameworks. Each has different assumptions about planning, tool use, and file-system interaction, and those assumptions change the score more than you'd expect. The reference agents, per the paper:
- OpenHands — open-source general-purpose software-engineering agent (formerly OpenDevin). Handles planning, file editing, and shell execution.
- OpenAI Codex CLI — OpenAI's command-line coding agent. Native GPT model support, strong at iterative file-based work.
- EnIGMA — cybersecurity-specialized agent from the SWE-agent lineage, designed for CTF-style tasks.
- Cybench agent — the reference agent from the Cybench benchmark paper, reused here for consistency with prior CTF-adjacent evaluations.
All four ran at an approximate $2-per-task budget cap -- a practical ceiling that limits how much reasoning the agent can do before being cut off. Different agent/model pairings produce materially different numbers; the leaderboard above reflects the paper's best-observed combinations. If you are rolling your own harness, expect to lose several points versus the headline numbers until you've tuned the planner and tool-calling loop.
Who Should Care About CyberGym?
The benchmark matters differently to different roles. Four audiences will get the most value from its numbers -- and need to read those numbers with different caveats.
Where to Find and Run CyberGym
Everything is public. You do not need a waitlist, an API key, or a vendor introduction to reproduce the benchmark yourself.
Video Resources
Video coverage pending editorial review. Walkthroughs for CyberGym methodology, paper-baseline reproduction, and scrutiny of Anthropic's internal 83.1% Mythos number are emerging across the security community. We will add verified video embeds once primary-source recordings (UC Berkeley author talks, Anthropic capability explainers) meet our sourcing threshold. Until then, the CyberGym paper (arXiv 2506.02548), HuggingFace dataset, and Anthropic's Mythos system card are the authoritative written sources.