The Benchmark, the Jailbreak, and the Shutdown: What Epoch AI's Evaluation Reveals About Fable 5's Capabilities

June 14, 2026 6 min read Southpasadenan Partial Moderate

Tech Jacks Solutions AI News Coverage

On June 12, 2026, Anthropic pulled two frontier models offline within hours of a US government directive. On the same day, Epoch AI published the first independent evaluation of those models' capabilities. Now that Fable 5 and Mythos 5 are gone, Epoch's findings are the closest thing the public has to a neutral account of what the government just removed from the market, and whether its case holds up.

The Five-Day Arc

Fable 5 launched on June 9, 2026. Three days later, it was gone.

The launch itself was unremarkable by frontier lab standards, Anthropic’s announcement
introduced both Fable 5 and Mythos 5, its paired cyber-capability model, with a system
card (arXiv:2605.14153) that reported substantial benchmark improvements. SWE-bench
Verified showed a 32% gain over prior Claude versions, according to Anthropic’s own
documentation. Agents’ Last Exam, evaluated by UC Berkeley’s RDI, came in at 22%. FrontierMath Tier 4 (v2), the benchmark that resists memorization by generating novel
problems, was listed as pending Epoch AI’s independent review.

That pending review is where this story turns.

Between launch and shutdown, the Technology pillar’s June 9-11 coverage flagged a
straightforward problem: Anthropic’s
benchmark claims were substantially self-reported. The SWE-bench improvement came from
a vendor system card. The UC Berkeley RDI score appeared in Anthropic’s own documentation
rather than a standalone RDI publication. Epoch’s evaluation, the one independent signal
in the stack, hadn’t cleared yet. Developers evaluating Fable 5 for production use were
working from incomplete information.

Then the jailbreak report surfaced. Then the government moved.

—

What Epoch AI Actually Found

Epoch AI published its independent brief on June 12, 2026, the same day as the shutdown. The timing is coincidental in the worst possible way for any clean assessment of the
models’ capabilities.

According to the Wire’s frontier lab scan, Epoch’s evaluation found Fable 5 reportedly
achieving 88% on FrontierMath Tier 4 (v2). This figure carries partial verification
status: the Epoch brief is confirmed to exist, but the benchmark page URL hadn’t resolved
at time of writing. Until it does, the 88% figure is reported, not independently confirmed
by this publication.

If accurate, 88% on FrontierMath Tier 4 (v2) is a meaningful result. The benchmark is
specifically constructed to resist pattern-matching against training data, Tier 4 problems
are novel enough that memorization doesn’t help. A frontier model genuinely clearing 88%
at that tier would represent real mathematical reasoning capability, not benchmark
engineering.

Epoch also evaluated Mythos 5’s cyber capabilities. The specific findings aren’t confirmed
from available source excerpts. The part nobody mentions here is that Mythos 5’s capability
profile, specifically the cyber dimension, is what reportedly triggered the security
concern that led to the government’s action. Epoch’s evaluation of exactly that capability
is the data point most relevant to adjudicating whether the government’s response was
proportionate. It’s also the one we can’t confirm yet.

Don’t expect resolution on this quickly. Epoch’s publication and the shutdown happened
within the same news cycle. Even Epoch’s own framing of Mythos 5’s cyber capabilities
may not directly address the specific vulnerability the government cited.

—

The Jailbreak Dispute: Three Accounts

Three accounts of the security event exist. None fully corroborate each other.

The Wall Street Journal reported that Amazon CEO Andy Jassy contacted senior US
administration officials, and those conversations escalated to a Commerce Department
ban. WSJ’s framing describes Jassy’s calls as “a general warning
that quickly escalated into a wide Commerce Department ban.” WSJ does not specifically
name the jailbreak methodology or the team that identified it.

SecurityWeek’s account identifies “an AI hacker”, not an Amazon cybersecurity team, as
the actor who claimed to have achieved a prompt-based jailbreak. SecurityWeek reported that
Anthropic disputes the characterization entirely, stating the demonstrated method “does
not bypass the core safety classifiers that prevent harmful outputs.” A separate source
confirms Anthropic’s position: the demonstrated method doesn’t reach the classifiers
designed to prevent harmful output.

These may be two separate events, an external researcher’s claim and an internal Amazon
assessment, or the same event described differently by reporters working from different
sources. The specific characterization that Fable 5 was used to extract information
usable in cyberattacks is not confirmed from any source that independently verified the
claim. It doesn’t appear in the Reuters confirmation, the WSJ piece, or the SecurityWeek
report with the specificity the Wire’s original item suggested.

The actor discrepancy matters for one concrete reason. If the jailbreak came from an
independent researcher, the government’s response to a prompt-based jailbreak that
Anthropic disputes is one kind of precedent. If it came from Amazon’s internal security
team flagging a supply chain vulnerability in a model they’d invested $13B in, that’s a
different kind of precedent entirely. The government’s legal authority for the directive –
which specific BIS provision or executive authority, hasn’t been confirmed from available
sources.

—

Stakeholder Map: Who Has What at Stake

Amazon’s position is structurally unusual. The company holds a cumulative $13B stake in
Anthropic per prior reporting, distributes Fable 5 via Bedrock, and, if WSJ’s account is
accurate, also triggered the government action that suspended the model. That’s not a
conflict of interest in any legal sense, but it’s a stakeholder dynamic that compliance
teams at AWS customers should understand before forming views on what happened.

—

What Developers and Compliance Teams Can Know Now

Developers with Fable 5 in active evaluation or production via Bedrock face a practical
problem: the evaluation window closed before the independent benchmarks cleared.

Here’s what’s actually confirmed:

– Fable 5 launched June 9. Access suspended June 12. The three-day window is a fact. – The SWE-bench Verified improvement (32%) and ALE score (22%) are from Anthropic’s own
system card. They’re vendor-reported figures, not independently reproduced. – The FrontierMath Tier 4 result (88%) carries an Epoch AI source attribution but hasn’t
been confirmed from a resolved URL. – Amazon Bedrock access is suspended. No re-access timeline has been disclosed.

For compliance teams at organizations with international user bases, the export control
action introduces a governance question that isn’t about Anthropic specifically. It’s about
whether AI models delivered via API, particularly from a US-based provider, now carry
regulatory exposure that didn’t previously exist. This particular directive barred foreign
national access. The mechanism Anthropic described, global shutdown because nationality
filtering wasn’t technically feasible, means US-based providers can’t guarantee geographic
access controls on frontier models in real time. That’s worth including in your AI vendor
risk assessment framework, regardless of whether Fable 5 comes back online.

For developers evaluating substitutes: wait for the Epoch URL to resolve before setting a
benchmark target. The 88% FrontierMath figure, if confirmed, gives you a calibration point
for what alternatives need to match on advanced mathematical reasoning. Don’t benchmark
against vendor-reported scores alone.

—

TJS Synthesis

The five-day arc from Fable 5’s launch to its suspension compressed what would normally
be a month-long benchmark credibility process into a news cycle. Epoch’s independent
evaluation arrived exactly as the models went offline. That timing is inconvenient for
everyone, for Anthropic, because the independent data that might have supported or
challenged the government’s case is now associated with a shutdown rather than a launch;
for developers, because the evaluation they needed before production decisions arrived
after access ended; and for the government, because the proportionality of its response
remains publicly contested by Anthropic and hasn’t been adjudicated by any court.

The precedent here isn’t primarily about Fable 5. It’s about export control as a real-time
AI governance instrument. The mechanism worked: a directive issued June 12 produced a
global model suspension on June 12. That speed is new. Traditional export controls operate
over procurement timelines. This one operated over API hours.

Compliance teams that haven’t modeled this as a risk in their AI vendor agreements should
do so now. The question isn’t whether this happens again. The question is what triggers it.

View Source

More Technology intelligence

View all Technology