What Is the CyberGym Benchmark?

CyberGym is a cybersecurity evaluation suite built by UC Berkeley (Wang, Shi, He, Cai, Zhang, Song; arXiv:2506.02548) that measures an AI agent's ability to discover and reproduce memory-safety vulnerabilities in real C/C++ open-source projects. It contains 1,507 task instances across 188 projects -- roughly 7.5x the size of NYU CTF (the largest prior public benchmark) and 35x+ Cybench or CVE-Bench.

Which AI model scores highest on CyberGym?

Claude Mythos Preview scores 83.1% pass@1 on all 1,507 tasks in Anthropic's internal run, per the Mythos Preview System Card (April 7, 2026). That is 3.8x the previous published top of GPT-5 with thinking at 22.0% (paper, Level 1, 300-instance subset). Numbers should be compared cautiously -- the paper results and Anthropic results use different sample sizes and harnesses.

How does CyberGym score an AI agent?

Execution-based. The agent must produce a Proof of Concept (PoC) input that triggers a sanitizer crash on the pre-patch version of the code AND passes cleanly on the post-patch version. No partial credit, no LLM-judge. The task is graded by running the PoC inside a containerized harness derived from Google OSS-Fuzz.

What are CyberGym's limitations?

CyberGym tests only memory-safety bugs in C/C++ projects detected by sanitizers. It does not cover logic flaws, cryptographic weaknesses, or web/mobile vulnerabilities. Success rates drop to roughly 10% when the required PoC exceeds 100 bytes, which applies to 65.7% of the benchmark -- a structural ceiling worth understanding before reading any headline score.

Anthropic Claude · Mythos Cluster

CyberGym Benchmark: Evaluating Offensive AI at Scale

CyberGym is a UC Berkeley cybersecurity benchmark (arXiv:2506.02548) that scores an AI agent on a blunt question: can it find real memory-safety bugs in real C/C++ open-source code, and prove it with a Proof of Concept input that crashes the pre-patch build? The suite contains 1,507 task instances across 188 projects, built on top of Google's OSS-Fuzz corpus -- roughly 7.5x the size of NYU CTF (the largest prior public benchmark at ~200 challenges) and more than 35x Cybench, CVE-Bench, AutoAdvExBench, BountyBench, or SEC-Bench. At that scale it finally has enough signal to separate frontier models from each other. It is also the benchmark where Claude Mythos Preview scored 83.1% pass@1 on the full 1,507 tasks in Anthropic's internal run -- a headline number that deserves closer reading than most will give it.

Quick Verdict

As of April 2026, CyberGym is the first cybersecurity benchmark big enough to differentiate frontier models. Mythos Preview scores 83.1% pass@1 (Anthropic internal, full 1,507 tasks) — roughly 3.8x the previous published top of 22.0% from GPT-5 with thinking (paper, Level 1, 300-instance subset). The numbers are not directly comparable, and that gap is part of the story.

1,507

Task Instances

188

Open-Source Projects

Zero-Days Found During Eval

What Is CyberGym?

CyberGym was introduced in the June 2025 paper CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale by Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song at UC Berkeley (arXiv:2506.02548). The paper's premise is that prior cybersecurity benchmarks were too small, too synthetic, or too narrow to tell you whether a frontier large language model could actually do offensive work.

A few numbers frame the jump in scale. NYU CTF has about 200 challenges. Cybench has 40. CVE-Bench has 40. AutoAdvExBench, BountyBench, and SEC-Bench sit in the same 40-200 range. CyberGym has 1,507 tasks, all derived from real vulnerabilities discovered in real C/C++ projects through Google's OSS-Fuzz continuous fuzzing service. Each task gives the agent a pre-patch codebase and some amount of scaffolding (see the difficulty ladder below), and demands a reproducer that triggers a sanitizer crash. There is no partial credit, no LLM-judge grading, and no synthetic "capture the flag" abstraction on top. Either the PoC crashes the unpatched code and passes on the patched code, or the task fails.

The paper, dataset, and code are all public. The benchmark lives on HuggingFace; the evaluation harness and agent scaffolding are at github.com/sunblaze-ucb/cybergym. That transparency matters -- it lets anyone reproduce a score instead of taking a vendor's word for it.

Why this matters for security teams: If you're planning around AI agents that claim offensive capability, CyberGym is currently the most credible public measuring stick for one narrow-but-important slice of that capability: can the agent produce working proofs of concept against real memory-safety bugs? That is not the whole of offensive security, but it is a question your security program needs an answer to.

The Leaderboard (and How to Read It)

The chart below plots every published CyberGym score we could verify against a primary source. Pay attention to the source column and sample size -- these rows are not directly comparable to each other. Anthropic's Mythos entry is a full 1,507-task run at pass@1 reported in the Mythos Preview System Card (April 7, 2026). The paper results evaluate on a 300-instance subset at Level 1 (the paper's primary evaluation setting). Mixing them in one bar chart overstates the gap; separating them is the only honest way to show both.

Anthropic internal, full 1,507 pass@1 Paper, Level 1, 300-instance subset Paper, small model baseline

Claude Mythos Preview Anthropic · full 1,507 · pass@1

83.1%

Claude Opus 4.6 Anthropic internal

67%

Claude Sonnet 4.6 Anthropic internal

65%

Claude Opus 4.5 Anthropic internal

51%

GPT-5 w/ thinking Paper · Level 1 · 300-instance

22.0%

Sonnet 4 w/ thinking Paper · Level 1 · 300-instance

19.3%

Sonnet 4 (base) Paper · Level 1 · 300-instance

17.9%

Claude 3.7 Sonnet w/ thinking Paper · Level 1 · 300-instance

17.7%

GPT-4.1 Paper · Level 1 · 300-instance

11.9%

Gemini 2.5 Flash Paper · Level 1 · 300-instance

4.8%

DeepSeek-V3 Paper · Level 1 · 300-instance

3.6%

Qwen3-235B-A22B Paper · Level 1 · 300-instance

2.7%

o4-mini Paper · Level 1 · 300-instance

2.5%

Two populations, one chart. The top four rows use Anthropic's internal harness against the full 1,507-task suite; the remaining rows use the paper's harness on a 300-instance Level 1 subset. Any straight-line comparison overstates the gap. That Mythos still sits comfortably above every paper-evaluated model is notable -- but the only fair apples-to-apples number on the paper's harness remains GPT-5 with thinking at 22.0%.

The honest read (as of April 2026): Anthropic's Mythos number is a big outlier, reported on a non-public harness. Until a third party reproduces it on the paper's settings, treat 83.1% as a vendor claim on a larger-than-baseline sample with Anthropic's own scaffolding. The paper-evaluated ceiling for models without that scaffolding sits near 22%. Both facts are true. Both matter.

Full Leaderboard: Sort and Filter

Click a column header to sort. Use the filter chips to show only paper-evaluated results, only Anthropic's internal runs, or only Claude-family models. This is the same data as the chart above, with the source and sample-size metadata exposed as first-class columns rather than a footnote.

#	Model	Score	Source	Sample	Setting
1	Claude Mythos Preview	83.1%	Anthropic	1,507 (full)	pass@1
2	Claude Opus 4.6	67%	Anthropic	Full suite	Internal run
3	Claude Sonnet 4.6	65%	Anthropic	Full suite	Internal run
4	Claude Opus 4.5	51%	Anthropic	Full suite	Internal run
5	GPT-5 w/ thinking	22.0%	Paper	300-instance	Level 1
6	Claude Sonnet 4 w/ thinking	19.3%	Paper	300-instance	Level 1
7	Claude Sonnet 4 (base)	17.9%	Paper	300-instance	Level 1
8	Claude 3.7 Sonnet w/ thinking	17.7%	Paper	300-instance	Level 1
9	GPT-4.1	11.9%	Paper	300-instance	Level 1
10	Gemini 2.5 Flash	4.8%	Paper	300-instance	Level 1
11	DeepSeek-V3	3.6%	Paper	300-instance	Level 1
12	Qwen3-235B-A22B w/ thinking	2.7%	Paper	300-instance	Level 1
13	o4-mini	2.5%	Paper	300-instance	Level 1
14	SWE-Gym-32B	0.1%	Paper	300-instance	Level 1

Paper-era caveat. The "Paper" rows above are frozen to the June 2025 paper run on the 300-instance Level 1 subset. Several models in that run are now retired or superseded: Claude 3.7 Sonnet and Claude Sonnet 4 (replaced by Sonnet 4.5/4.6), GPT-4.1 (superseded by GPT-5.x), and o4-mini (superseded). Treat those rows as a historical snapshot, not a current recommendation. Any current comparison needs a fresh vendor-run or community re-evaluation.

How CyberGym Scores an Agent

CyberGym defines four difficulty levels, ordered by how much information the agent receives up front. Scoring is execution-based: the agent submits a PoC, the harness runs it against both the pre-patch and post-patch build, and the task passes only if the PoC crashes the unpatched code under a sanitizer and runs cleanly against the patched code. There is no LLM-as-judge stage, which removes one common source of benchmark inflation.

Level 0

Open-Ended Discovery

Pre-patch codebase only. Agent must find a bug on its own, with no pointers to where it lives or what class of flaw it is. The hardest setting -- closest to zero-day hunting.

Level 1

Primary Task

Codebase + a short textual vulnerability description. The paper's primary evaluation setting. Models are told something is wrong here, but must still locate and reproduce it.

Level 2

Crash Trace Provided

Level 1 plus the sanitizer crash stack trace. Now the agent has the approximate crash site. Closer to a reproduction task than a discovery task.

Level 3

One-Day Patch Diff

Level 2 plus the ground-truth patch diff. Agent has everything short of the PoC itself -- this is the "one-day" scenario where a patch has shipped and the exploit must be reconstructed.

Scoring in One Sentence

A PoC must trigger a sanitizer crash (AddressSanitizer, MemorySanitizer, UndefinedBehaviorSanitizer, etc.) on the pre-patch build and run cleanly on the post-patch build. Binary pass/fail. Execution-graded.

Where the Tasks Come From

All 1,507 tasks derive from Google's OSS-Fuzz, a continuous fuzzing service that finds real bugs in production open-source projects. The UC Berkeley team used automated filters plus manual validation to select bugs that are reproducible in a containerized harness -- a non-trivial engineering effort, since OSS-Fuzz crashes are often tied to specific compiler flags, build systems, and sanitizer configurations.

Scope: What CyberGym Actually Measures

CyberGym tests one category of flaw: memory-safety bugs in C/C++ projects that a sanitizer can detect at runtime. That is a meaningful slice of real-world vulnerabilities -- it includes most of what gets CVEs in browsers, media codecs, cryptographic libraries, kernels, and parsers. It is not the whole security universe.

The 28 Sanitizer Classes

Tasks are labeled with the sanitizer crash type that validates them. The benchmark covers 28 distinct classes, including:

Buffer overflows — heap, stack, and global buffer out-of-bounds reads/writes
Use-after-free — accessing memory after deallocation
Null pointer dereference — classic crash primitive that sometimes escalates
Integer overflow / underflow — detected by UBSan
Uninitialized memory reads — MSan-detected data leaks
Stack-buffer-overflow, double-free, memory-leak — the rest of the ASan family

If you've worked on a memory safety refactor in a legacy C codebase, this is the bug catalog you recognize. That familiarity is the benchmark's strength -- these are the exact flaws that ship, get exploited, and get CVEs. The same familiarity is its limit: CyberGym does not see flaws outside this window.

Limitations You Need to Understand Before Citing a Score

CyberGym is among the strongest public cybersecurity benchmarks as of April 2026. That does not make it complete. Five structural limits shape every score the benchmark produces; skip these and any claim you make about "offensive AI capability" will be off-base.

No logic flaws, no business-logic auth bypasses, no cryptographic weaknesses, no web or mobile app tests. If your threat model is a web exploit-chaining scenario or an OAuth misconfiguration, CyberGym's score tells you nothing.

The paper measures a steep accuracy cliff: when the required PoC exceeds 100 bytes, agent success rate drops to roughly 10%. That ceiling applies to 65.7% of benchmark tasks. Models that produce short exploit inputs look far better than they would against real-world bugs that need multi-kilobyte payloads.

The paper's failure-mode analysis finds that 30% of failures are premature termination (the agent gives up), 20% are context-exhausting plaintext PoCs (the agent dumps raw binary into its context until it crashes), and higher-difficulty runs burn steps on repeated grep, ls, and find calls instead of reasoning. Harness quality matters as much as model quality.

The paper evaluates on a 300-instance Level 1 subset with the reference OpenHands / Codex CLI / EnIGMA / Cybench agents at roughly $2/task. Anthropic's Mythos, Opus 4.6, Sonnet 4.6, and Opus 4.5 numbers come from an internal run on the full 1,507-task suite at pass@1 using Anthropic's own scaffolding. The two are not directly comparable. No independent party has yet reproduced the Mythos 83.1% number on the paper's harness; until that happens, treat it as a vendor claim.

All tasks derive from public OSS-Fuzz data. Once a benchmark is public, models released afterward may have been trained on chat logs, patches, or discussions that reference the underlying bugs. As of April 2026 there is no evidence of contamination, but every frontier model release is a potential inflation event. Always note the evaluation date.

Real-World Impact: Zero-Days and CVEs

Here is the detail that turns CyberGym from an academic exercise into a live discussion for security teams: building the benchmark itself produced real findings. The UC Berkeley team reports that evaluation runs discovered 34 previously-unknown vulnerabilities and 18 incomplete patches in the underlying open-source projects. At the paper's June 2025 snapshot, 4 CVEs had been assigned and 10 of the issues were patched by upstream maintainers; current disclosure and patch counts will be higher.

A benchmark that discovers zero-days while it is being built is a benchmark with teeth. It also raises an ethical axis the paper acknowledges: any agent that can score well on CyberGym can, by construction, find real bugs in real software. That is the intended signal. It is also why Anthropic ties Mythos capability disclosures to an AI Safety Level 3 (ASL-3) classification and extended deployment review.

For defenders: The short-term implication is that fuzzing corpora and OSS-Fuzz-style infrastructure are now also capability elicitation harnesses. If your organization ships C/C++ code, assume that at least some adversaries will run CyberGym-style agents against your public repositories. Prioritize memory-safety hardening (ASan in CI, fuzz integration, migration to memory-safe languages where feasible) accordingly. See AI governance hub for the policy-layer implications.

Who Runs It: Agent Frameworks

The paper evaluates the frontier model lineup wrapped in four agent frameworks. Each has different assumptions about planning, tool use, and file-system interaction, and those assumptions change the score more than you'd expect. The reference agents, per the paper:

OpenHands — open-source general-purpose software-engineering agent (formerly OpenDevin). Handles planning, file editing, and shell execution.
OpenAI Codex CLI — OpenAI's command-line coding agent. Native GPT model support, strong at iterative file-based work.
EnIGMA — cybersecurity-specialized agent from the SWE-agent lineage, designed for CTF-style tasks.
Cybench agent — the reference agent from the Cybench benchmark paper, reused here for consistency with prior CTF-adjacent evaluations.

All four ran at an approximate $2-per-task budget cap -- a practical ceiling that limits how much reasoning the agent can do before being cut off. Different agent/model pairings produce materially different numbers; the leaderboard above reflects the paper's best-observed combinations. If you are rolling your own harness, expect to lose several points versus the headline numbers until you've tuned the planner and tool-calling loop.

Who Should Care About CyberGym?

The benchmark matters differently to different roles. Four audiences will get the most value from its numbers -- and need to read those numbers with different caveats.

AI Researcher

Frontier Capability Evaluator

You care about CyberGym because it is one of the few benchmarks that stays hard even for the best models. Pay attention to Level 0 and Level 1 numbers, harness variance, and failure-mode breakdowns -- not just headline pass@1.

Red Team Lead

Offensive Capability Benchmarker

You want to know whether AI agents can replace or augment junior red-team work. CyberGym gives you a memory-safety answer. It does not give you a web-app or phishing answer. Budget time for your own supplemental evaluation against your actual threat model.

Security VP / CISO

Risk and Policy Planner

You read CyberGym for the policy implication: offensive AI capability now exists at a measurable level. Use it to inform ASL-3 / EU AI Act conversations, not to calibrate your SOC tooling. Cross-reference with the AI Governance Hub for the regulatory context.

Academic

Reproducibility Reviewer

You care about the 300-instance subset, the pass@k distribution, and whether anyone has reproduced the 83.1% claim on the paper's harness. The HuggingFace dataset and GitHub repo are your starting points -- everything is open.

Where to Find and Run CyberGym

Everything is public. You do not need a waitlist, an API key, or a vendor introduction to reproduce the benchmark yourself.

HuggingFace Dataset

sunblaze-ucb/cybergym

Full 1,507-task dataset with pre-patch / post-patch builds, sanitizer metadata, and reference PoCs. The place to start for any reproduction.

GitHub Repo

sunblaze-ucb/cybergym

Containerized evaluation harness plus agent integration points for OpenHands, Codex CLI, EnIGMA, and Cybench. Apache-licensed; fork and modify.

OSS-Fuzz (upstream)

google/oss-fuzz

The source of the underlying vulnerability data. Useful if you want to understand where tasks come from or build a parallel benchmark against newer bugs.

Paper (arXiv)

arXiv:2506.02548

Full methodology, failure-mode analysis, and the reference numbers most reproducibility arguments will point at. Read section 5 for the failure breakdown.

Video Resources

Video coverage pending editorial review. Walkthroughs for CyberGym methodology, paper-baseline reproduction, and scrutiny of Anthropic's internal 83.1% Mythos number are emerging across the security community. We will add verified video embeds once primary-source recordings (UC Berkeley author talks, Anthropic capability explainers) meet our sourcing threshold. Until then, the CyberGym paper (arXiv 2506.02548), HuggingFace dataset, and Anthropic's Mythos system card are the authoritative written sources.

Where to Go Next

You are here: CyberGym Benchmark (methodology and scores) → Next: What Is Claude Mythos (the model behind the 83.1%) → Then: Project Glasswing (Anthropic's offensive-security research program) → Or: Exploit Chaining Playbook (what CyberGym does not cover).

Gallery

Contacts