Google DeepMind Claims 48% on FrontierMath Tier 4, What the AI Co-mathematician Record Needs Before It's Confirmed

May 12, 2026 2 min read Google DeepMind Research Blog Qualified Very Weak

Tech Jacks Solutions AI News Coverage

Google DeepMind published a technical paper describing an AI Co-mathematician, a Gemini-based agentic workbench for open-ended mathematical research, that a paper listed on arXiv claims scored 48% on FrontierMath Tier 4. Paper authorship and independent evaluation status haven't been confirmed.

ai-models-news generative-ai-news frontier-ai-benchmarks google-deepmind ai-research-tools frontiermath

FrontierMath Tier 4 claimed score, 48%

Key Takeaways

Google DeepMind claims its AI Co-mathematician scored 48% on FrontierMath Tier 4, described by the vendor as the highest score for any AI system; not independently confirmed
The claim appears in a paper listed on arXiv (2605.11246) whose authorship and independent status haven't been verified
The system manages persistent research state including hypothesis testing, literature synthesis, and proof review, architectural claims are vendor-attributed
Google DeepMind states two conjectures have been proved and are currently under human expert review

Model Release

AI Co-mathematician

OrganizationGoogle DeepMind / Google Research

TypeLLM — Flagship

ParametersNot disclosed

Benchmark[SELF-REPORTED] 48% FrontierMath Tier 4 (arXiv:2605.11246, authorship unconfirmed; vendor characterizes as highest AI score on benchmark)

AvailabilityLimited initial release

Verification

Qualified Google DeepMind Research Blog + arXiv:2605.11246 (authorship unconfirmed) 48% FrontierMath Tier 4 score unverified at T1/T2 level. Paper authorship unknown, independent evaluation status cannot be confirmed. GPT-5.4 comparison unverifiable.

The research workbench concept is worth understanding before the benchmark number. Google DeepMind’s AI Co-mathematician isn’t a chat assistant for math questions, it’s described as a persistent research environment that manages iterative hypothesis testing, literature synthesis, and proof reviews across sessions, per Google DeepMind’s research blog. That architectural framing matters: this is positioned as a tool for open-ended mathematical discovery, not single-session problem solving.

Google DeepMind states the system established proofs for two conjectures currently undergoing detailed human review. That claim is vendor-attributed and hasn’t been independently confirmed, but the human review framing is itself significant, it implies the proofs are substantive enough to warrant expert verification rather than being trivially checkable.

Then there’s the benchmark claim. According to a paper listed on arXiv (2605.11246), the system achieved 48% on FrontierMath Tier 4, which Google DeepMind characterizes as the highest score for any AI system on the benchmark to date. That characterization is Google DeepMind’s own, not an independently adjudicated fact. The paper’s authorship hasn’t been confirmed, which means it’s not yet known whether the evaluation was conducted by Google DeepMind internally or by an independent body.

Disputed Claim

48% on FrontierMath Tier 4, highest score for any AI system on the benchmark

Claimed in an arXiv paper whose authorship has not been confirmed. Could be vendor-authored evaluation or independent evaluation, distinction matters significantly for credibility.

Do not cite as confirmed benchmark record. Wait for arXiv paper authorship confirmation and Epoch AI independent evaluation before incorporating into tooling decisions.

FrontierMath is the benchmark designed by Epoch AI with domain experts specifically to resist memorization, its problems are novel, contest-difficulty, and verifiably correct. Tier 4 is the hardest tier. If 48% on Tier 4 holds up under independent scrutiny, it’s a meaningful result. Previous frontier models have scored in the single digits on FrontierMath overall. The distance between “score on a hard benchmark” and “independently verified result” is where AI benchmark claims most often collapse.

The GPT-5.4 comparison mentioned in the paper can’t be evaluated here, GPT-5.4 as a model designation couldn’t be confirmed independently, and that comparison should be treated as unverified until the paper’s full contents are reviewed.

The part nobody mentions in benchmark announcements like this: FrontierMath Tier 4 performance doesn’t automatically translate to practical mathematical research utility. A system that scores 48% on novel contest problems under benchmark conditions may behave very differently in a real research workflow where problem formulation, dead ends, and interdisciplinary context are the actual bottlenecks. The architectural features, persistent state, literature synthesis, iterative hypothesis testing, are the more durable claim, because they describe a workflow design rather than a performance number.

What to Watch

arXiv paper 2605.11246 authorship confirmation, independent vs. vendor-authoredDays to weeks

Human expert review verdict on two claimed conjecturesWeeks to months

Epoch AI independent evaluation of AI Co-mathematician on FrontierMathUnknown

Don’t treat this as a confirmed record until arXiv paper authorship is confirmed and Epoch AI releases independent evaluation data. The architectural claims are credible and worth tracking. The 48% figure is a claimed result from an unconfirmed paper, reported by the entity whose system is being evaluated.

TJS synthesis: Watch for two things: first, arXiv paper authorship confirmation, if the evaluation was conducted independently, the 48% number upgrades significantly; second, whether the human review of the two conjectures produces a verdict. Those two data points will tell you more about this system than the benchmark number alone.