Over 10 years we help companies reach their financial and branding goals. Engitech is a values-driven technology agency dedicated.

Gallery

Contacts

411 University St, Seattle, USA

engitech@oceanthemes.net

+1 -800-456-478-23

Anthropic Claude · Mythos Cluster

CyberGym Benchmark: Evaluating Offensive AI at Scale

CyberGym is a UC Berkeley cybersecurity benchmark (arXiv:2506.02548) that scores an AI agent on a blunt question: can it find real memory-safety bugs in real C/C++ open-source code, and prove it with a Proof of Concept input that crashes the pre-patch build? The suite contains 1,507 task instances across 188 projects, built on top of Google's OSS-Fuzz corpus -- roughly 7.5x the size of NYU CTF (the largest prior public benchmark at ~200 challenges) and more than 35x Cybench, CVE-Bench, AutoAdvExBench, BountyBench, or SEC-Bench. At that scale it finally has enough signal to separate frontier models from each other. It is also the benchmark where Claude Mythos Preview scored 83.1% pass@1 on the full 1,507 tasks in Anthropic's internal run -- a headline number that deserves closer reading than most will give it.

Quick Verdict
As of April 2026, CyberGym is the first cybersecurity benchmark big enough to differentiate frontier models. Mythos Preview scores 83.1% pass@1 (Anthropic internal, full 1,507 tasks) — roughly 3.8x the previous published top of 22.0% from GPT-5 with thinking (paper, Level 1, 300-instance subset). The numbers are not directly comparable, and that gap is part of the story.
1,507
Task Instances
188
Open-Source Projects
34
Zero-Days Found During Eval

What Is CyberGym?

CyberGym was introduced in the June 2025 paper CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale by Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song at UC Berkeley (arXiv:2506.02548). The paper's premise is that prior cybersecurity benchmarks were too small, too synthetic, or too narrow to tell you whether a frontier large language model could actually do offensive work.

A few numbers frame the jump in scale. NYU CTF has about 200 challenges. Cybench has 40. CVE-Bench has 40. AutoAdvExBench, BountyBench, and SEC-Bench sit in the same 40-200 range. CyberGym has 1,507 tasks, all derived from real vulnerabilities discovered in real C/C++ projects through Google's OSS-Fuzz continuous fuzzing service. Each task gives the agent a pre-patch codebase and some amount of scaffolding (see the difficulty ladder below), and demands a reproducer that triggers a sanitizer crash. There is no partial credit, no LLM-judge grading, and no synthetic "capture the flag" abstraction on top. Either the PoC crashes the unpatched code and passes on the patched code, or the task fails.

The paper, dataset, and code are all public. The benchmark lives on HuggingFace; the evaluation harness and agent scaffolding are at github.com/sunblaze-ucb/cybergym. That transparency matters -- it lets anyone reproduce a score instead of taking a vendor's word for it.

Why this matters for security teams: If you're planning around AI agents that claim offensive capability, CyberGym is currently the most credible public measuring stick for one narrow-but-important slice of that capability: can the agent produce working proofs of concept against real memory-safety bugs? That is not the whole of offensive security, but it is a question your security program needs an answer to.


The Leaderboard (and How to Read It)

The chart below plots every published CyberGym score we could verify against a primary source. Pay attention to the source column and sample size -- these rows are not directly comparable to each other. Anthropic's Mythos entry is a full 1,507-task run at pass@1 reported in the Mythos Preview System Card (April 7, 2026). The paper results evaluate on a 300-instance subset at Level 1 (the paper's primary evaluation setting). Mixing them in one bar chart overstates the gap; separating them is the only honest way to show both.

Anthropic internal, full 1,507 pass@1 Paper, Level 1, 300-instance subset Paper, small model baseline
Claude Mythos Preview Anthropic · full 1,507 · pass@1
83.1%
Claude Opus 4.6 Anthropic internal
67%
Claude Sonnet 4.6 Anthropic internal
65%
Claude Opus 4.5 Anthropic internal
51%
GPT-5 w/ thinking Paper · Level 1 · 300-instance
22.0%
Sonnet 4 w/ thinking Paper · Level 1 · 300-instance
19.3%
Sonnet 4 (base) Paper · Level 1 · 300-instance
17.9%
Claude 3.7 Sonnet w/ thinking Paper · Level 1 · 300-instance
17.7%
GPT-4.1 Paper · Level 1 · 300-instance
11.9%
Gemini 2.5 Flash Paper · Level 1 · 300-instance
4.8%
DeepSeek-V3 Paper · Level 1 · 300-instance
3.6%
Qwen3-235B-A22B Paper · Level 1 · 300-instance
2.7%
o4-mini Paper · Level 1 · 300-instance
2.5%
Two populations, one chart. The top four rows use Anthropic's internal harness against the full 1,507-task suite; the remaining rows use the paper's harness on a 300-instance Level 1 subset. Any straight-line comparison overstates the gap. That Mythos still sits comfortably above every paper-evaluated model is notable -- but the only fair apples-to-apples number on the paper's harness remains GPT-5 with thinking at 22.0%.

The honest read (as of April 2026): Anthropic's Mythos number is a big outlier, reported on a non-public harness. Until a third party reproduces it on the paper's settings, treat 83.1% as a vendor claim on a larger-than-baseline sample with Anthropic's own scaffolding. The paper-evaluated ceiling for models without that scaffolding sits near 22%. Both facts are true. Both matter.


Full Leaderboard: Sort and Filter

Click a column header to sort. Use the filter chips to show only paper-evaluated results, only Anthropic's internal runs, or only Claude-family models. This is the same data as the chart above, with the source and sample-size metadata exposed as first-class columns rather than a footnote.

# Model Score Source Sample Setting
1Claude Mythos Preview83.1%Anthropic1,507 (full)pass@1
2Claude Opus 4.667%AnthropicFull suiteInternal run
3Claude Sonnet 4.665%AnthropicFull suiteInternal run
4Claude Opus 4.551%AnthropicFull suiteInternal run
5GPT-5 w/ thinking22.0%Paper300-instanceLevel 1
6Claude Sonnet 4 w/ thinking19.3%Paper300-instanceLevel 1
7Claude Sonnet 4 (base)17.9%Paper300-instanceLevel 1
8Claude 3.7 Sonnet w/ thinking17.7%Paper300-instanceLevel 1
9GPT-4.111.9%Paper300-instanceLevel 1
10Gemini 2.5 Flash4.8%Paper300-instanceLevel 1
11DeepSeek-V33.6%Paper300-instanceLevel 1
12Qwen3-235B-A22B w/ thinking2.7%Paper300-instanceLevel 1
13o4-mini2.5%Paper300-instanceLevel 1
14SWE-Gym-32B0.1%Paper300-instanceLevel 1

Paper-era caveat. The "Paper" rows above are frozen to the June 2025 paper run on the 300-instance Level 1 subset. Several models in that run are now retired or superseded: Claude 3.7 Sonnet and Claude Sonnet 4 (replaced by Sonnet 4.5/4.6), GPT-4.1 (superseded by GPT-5.x), and o4-mini (superseded). Treat those rows as a historical snapshot, not a current recommendation. Any current comparison needs a fresh vendor-run or community re-evaluation.


How CyberGym Scores an Agent

CyberGym defines four difficulty levels, ordered by how much information the agent receives up front. Scoring is execution-based: the agent submits a PoC, the harness runs it against both the pre-patch and post-patch build, and the task passes only if the PoC crashes the unpatched code under a sanitizer and runs cleanly against the patched code. There is no LLM-as-judge stage, which removes one common source of benchmark inflation.

Level 0
Open-Ended Discovery
Pre-patch codebase only. Agent must find a bug on its own, with no pointers to where it lives or what class of flaw it is. The hardest setting -- closest to zero-day hunting.
Level 1
Primary Task
Codebase + a short textual vulnerability description. The paper's primary evaluation setting. Models are told something is wrong here, but must still locate and reproduce it.
Level 2
Crash Trace Provided
Level 1 plus the sanitizer crash stack trace. Now the agent has the approximate crash site. Closer to a reproduction task than a discovery task.
Level 3
One-Day Patch Diff
Level 2 plus the ground-truth patch diff. Agent has everything short of the PoC itself -- this is the "one-day" scenario where a patch has shipped and the exploit must be reconstructed.

Scoring in One Sentence

A PoC must trigger a sanitizer crash (AddressSanitizer, MemorySanitizer, UndefinedBehaviorSanitizer, etc.) on the pre-patch build and run cleanly on the post-patch build. Binary pass/fail. Execution-graded.

Where the Tasks Come From

All 1,507 tasks derive from Google's OSS-Fuzz, a continuous fuzzing service that finds real bugs in production open-source projects. The UC Berkeley team used automated filters plus manual validation to select bugs that are reproducible in a containerized harness -- a non-trivial engineering effort, since OSS-Fuzz crashes are often tied to specific compiler flags, build systems, and sanitizer configurations.


Scope: What CyberGym Actually Measures

CyberGym tests one category of flaw: memory-safety bugs in C/C++ projects that a sanitizer can detect at runtime. That is a meaningful slice of real-world vulnerabilities -- it includes most of what gets CVEs in browsers, media codecs, cryptographic libraries, kernels, and parsers. It is not the whole security universe.

The 28 Sanitizer Classes

Tasks are labeled with the sanitizer crash type that validates them. The benchmark covers 28 distinct classes, including:

  • Buffer overflows — heap, stack, and global buffer out-of-bounds reads/writes
  • Use-after-free — accessing memory after deallocation
  • Null pointer dereference — classic crash primitive that sometimes escalates
  • Integer overflow / underflow — detected by UBSan
  • Uninitialized memory reads — MSan-detected data leaks
  • Stack-buffer-overflow, double-free, memory-leak — the rest of the ASan family

If you've worked on a memory safety refactor in a legacy C codebase, this is the bug catalog you recognize. That familiarity is the benchmark's strength -- these are the exact flaws that ship, get exploited, and get CVEs. The same familiarity is its limit: CyberGym does not see flaws outside this window.


Limitations You Need to Understand Before Citing a Score

CyberGym is among the strongest public cybersecurity benchmarks as of April 2026. That does not make it complete. Five structural limits shape every score the benchmark produces; skip these and any claim you make about "offensive AI capability" will be off-base.

Scope is memory-safety only
No logic flaws, no business-logic auth bypasses, no cryptographic weaknesses, no web or mobile app tests. If your threat model is a web exploit-chaining scenario or an OAuth misconfiguration, CyberGym's score tells you nothing.
The 100-byte PoC wall
The paper measures a steep accuracy cliff: when the required PoC exceeds 100 bytes, agent success rate drops to roughly 10%. That ceiling applies to 65.7% of benchmark tasks. Models that produce short exploit inputs look far better than they would against real-world bugs that need multi-kilobyte payloads.
Agent failure modes inflate simple tasks
The paper's failure-mode analysis finds that 30% of failures are premature termination (the agent gives up), 20% are context-exhausting plaintext PoCs (the agent dumps raw binary into its context until it crashes), and higher-difficulty runs burn steps on repeated grep, ls, and find calls instead of reasoning. Harness quality matters as much as model quality.
Paper vs vendor-internal methodology
The paper evaluates on a 300-instance Level 1 subset with the reference OpenHands / Codex CLI / EnIGMA / Cybench agents at roughly $2/task. Anthropic's Mythos, Opus 4.6, Sonnet 4.6, and Opus 4.5 numbers come from an internal run on the full 1,507-task suite at pass@1 using Anthropic's own scaffolding. The two are not directly comparable. No independent party has yet reproduced the Mythos 83.1% number on the paper's harness; until that happens, treat it as a vendor claim.
Benchmark freshness decays fast
All tasks derive from public OSS-Fuzz data. Once a benchmark is public, models released afterward may have been trained on chat logs, patches, or discussions that reference the underlying bugs. As of April 2026 there is no evidence of contamination, but every frontier model release is a potential inflation event. Always note the evaluation date.

Real-World Impact: Zero-Days and CVEs

Here is the detail that turns CyberGym from an academic exercise into a live discussion for security teams: building the benchmark itself produced real findings. The UC Berkeley team reports that evaluation runs discovered 34 previously-unknown vulnerabilities and 18 incomplete patches in the underlying open-source projects. At the paper's June 2025 snapshot, 4 CVEs had been assigned and 10 of the issues were patched by upstream maintainers; current disclosure and patch counts will be higher.

A benchmark that discovers zero-days while it is being built is a benchmark with teeth. It also raises an ethical axis the paper acknowledges: any agent that can score well on CyberGym can, by construction, find real bugs in real software. That is the intended signal. It is also why Anthropic ties Mythos capability disclosures to an AI Safety Level 3 (ASL-3) classification and extended deployment review.

For defenders: The short-term implication is that fuzzing corpora and OSS-Fuzz-style infrastructure are now also capability elicitation harnesses. If your organization ships C/C++ code, assume that at least some adversaries will run CyberGym-style agents against your public repositories. Prioritize memory-safety hardening (ASan in CI, fuzz integration, migration to memory-safe languages where feasible) accordingly. See AI governance hub for the policy-layer implications.


Who Runs It: Agent Frameworks

The paper evaluates the frontier model lineup wrapped in four agent frameworks. Each has different assumptions about planning, tool use, and file-system interaction, and those assumptions change the score more than you'd expect. The reference agents, per the paper:

  • OpenHands — open-source general-purpose software-engineering agent (formerly OpenDevin). Handles planning, file editing, and shell execution.
  • OpenAI Codex CLI — OpenAI's command-line coding agent. Native GPT model support, strong at iterative file-based work.
  • EnIGMA — cybersecurity-specialized agent from the SWE-agent lineage, designed for CTF-style tasks.
  • Cybench agent — the reference agent from the Cybench benchmark paper, reused here for consistency with prior CTF-adjacent evaluations.

All four ran at an approximate $2-per-task budget cap -- a practical ceiling that limits how much reasoning the agent can do before being cut off. Different agent/model pairings produce materially different numbers; the leaderboard above reflects the paper's best-observed combinations. If you are rolling your own harness, expect to lose several points versus the headline numbers until you've tuned the planner and tool-calling loop.


Who Should Care About CyberGym?

The benchmark matters differently to different roles. Four audiences will get the most value from its numbers -- and need to read those numbers with different caveats.

AI Researcher
Frontier Capability Evaluator
You care about CyberGym because it is one of the few benchmarks that stays hard even for the best models. Pay attention to Level 0 and Level 1 numbers, harness variance, and failure-mode breakdowns -- not just headline pass@1.
Red Team Lead
Offensive Capability Benchmarker
You want to know whether AI agents can replace or augment junior red-team work. CyberGym gives you a memory-safety answer. It does not give you a web-app or phishing answer. Budget time for your own supplemental evaluation against your actual threat model.
Security VP / CISO
Risk and Policy Planner
You read CyberGym for the policy implication: offensive AI capability now exists at a measurable level. Use it to inform ASL-3 / EU AI Act conversations, not to calibrate your SOC tooling. Cross-reference with the AI Governance Hub for the regulatory context.
Academic
Reproducibility Reviewer
You care about the 300-instance subset, the pass@k distribution, and whether anyone has reproduced the 83.1% claim on the paper's harness. The HuggingFace dataset and GitHub repo are your starting points -- everything is open.


Video Resources

Video coverage pending editorial review. Walkthroughs for CyberGym methodology, paper-baseline reproduction, and scrutiny of Anthropic's internal 83.1% Mythos number are emerging across the security community. We will add verified video embeds once primary-source recordings (UC Berkeley author talks, Anthropic capability explainers) meet our sourcing threshold. Until then, the CyberGym paper (arXiv 2506.02548), HuggingFace dataset, and Anthropic's Mythos system card are the authoritative written sources.


Data verified: 2026-04-13
Data verified: 2026-04-13. CyberGym is research software released by UC Berkeley (Sunblaze Lab) under an open-source license. Claude and Mythos are trademarks of Anthropic. GPT is a trademark of OpenAI. Gemini is a trademark of Google LLC. DeepSeek, Qwen, and SWE-Gym are trademarks of their respective owners. This article is independent editorial analysis and was not sponsored or reviewed by any vendor.
Before You Use AI
Your Privacy

CyberGym tasks run inside containerized harnesses on your own infrastructure; no data leaves your environment unless you point the agent at a hosted model API. If you use a commercial API (Anthropic, OpenAI, Google) to run the benchmark, vendor data-handling policies apply -- enterprise and business plans generally exclude training on API traffic, free/consumer tiers may not. Review the policy of whichever API you call before pointing an agent at proprietary or sensitive code.

Mental Health & AI Dependency

Offensive-security work combines high cognitive load with real ethical weight -- the bugs in CyberGym are real, the CVEs are real, and the disclosure pressure is real. Take breaks. Do not let an agent's confidence substitute for your own review. If you or a teammate is in crisis:

  • 988 Suicide & Crisis Lifeline -- Call or text 988 (US)
  • SAMHSA Helpline -- 1-800-662-4357
  • Crisis Text Line -- Text HOME to 741741
Your Rights & Our Transparency

Under GDPR and CCPA, you have the right to access, correct, and delete your personal data. Tech Jacks Solutions maintains editorial independence from all vendors, including Anthropic, OpenAI, Google, UC Berkeley, and the CyberGym authors. This article was not sponsored, reviewed, or approved by any of them. We do not receive affiliate commissions from Anthropic subscriptions or API usage. Evaluations are based on primary documentation (arXiv paper, Anthropic system card), the public dataset, and the public codebase.