AI Models News: OpenAI's LifeSciBench Shows Best Model Fails 64% of Expert-Designed Life Science Tasks

June 18, 2026 3 min read Openai Qualified Strong

Tech Jacks Solutions AI News Coverage

OpenAI published LifeSciBench, a new evaluation suite for AI performance on life science research workflows, and according to OpenAI's own results, the top-performing model, GPT-Rosalind, passed fewer than 4 in 10 tasks. The benchmark is designed to be hard; that failure rate is the point.

ai-benchmarks life-science-ai openai gpt-rosalind lifescibench biotech-ai drug-discovery-ai model-evaluation ai-models-news

GPT-Rosalind task pass rate, 36.1% (OpenAI)

Key Takeaways

According to OpenAI, GPT-Rosalind achieved a 36.1% task pass rate on LifeSciBench, the highest of all tested models, meaning the best available AI failed 64% of expert-designed life science tasks. All performance figures are OpenAI-reported from a benchmark OpenAI designed and ran; Epoch AI has not independently evaluated GPT-Rosalind on LifeSciBench. Primary source URL was inaccessible for direct verification. According to OpenAI, 34.8% of tasks had a best-model pass rate below 20%, suggesting entire categories of life science reasoning that no current frontier model handles reliably. LifeSciBench's expert-authored, attachment-heavy design makes the failure rate more informative than standard benchmarks; treat it as a floor-setter, not a product review.

Model Release

GPT-Rosalind via LifeSciBench

OrganizationOpenAI

TypeLLM — Flagship

ParametersNot disclosed

Benchmark[SELF-REPORTED] LifeSciBench: 36.1% task pass rate, 0.576 weighted normalized score (OpenAI-reported; primary source inaccessible for verification)

AvailabilityRestricted trusted-access review

Verification

Qualified OpenAI Research Index, single vendor source, primary URL broken at publication All figures are OpenAI-reported. Expert involvement was in benchmark design, not independent model evaluation. Epoch AI evaluation pending. Treat all scores as self-reported until independent evaluation is published.

Most AI benchmark announcements lead with what the model achieved. LifeSciBench inverts that. The
editorial number here isn’t GPT-Rosalind’s 36.1% task pass rate, it’s the 63.9% it didn’t pass. According to OpenAI, that’s the best result any model produced on this evaluation suite. For
organizations evaluating AI for drug discovery, biotech R&D, or clinical research workflows, that
gap is the most honest data point available on where the technology actually stands.

The benchmark details matter for interpreting the number. According to OpenAI’s publication,
LifeSciBench consists of 750 tasks spanning seven core scientific workflows and seven biological
domains, developed with 173 scientist contributors and reviewed by 453 domain experts. OpenAI
states that 53% of tasks require attachment review, meaning this isn’t a text-only evaluation. Models had to process documents, figures, and data files representative of real research
workflows, not synthetic question-answer pairs. That design choice makes the failure rate more
informative than a standard benchmark would be.

The catch is the sourcing. OpenAI developed LifeSciBench, OpenAI ran the evaluations, and OpenAI
published the results. The expert involvement was in benchmark design and review, not in
independently evaluating the models. According to OpenAI’s results, GPT-Rosalind achieved a
36.1% task pass rate and a problem-weighted normalized score of 0.576, outperforming comparison
frontier and domain-specialized models. Those model names and their specific scores appear in
OpenAI’s publication; specific version designations for comparison models can’t be confirmed from
resolved sources and are omitted here. The primary source URL was inaccessible for direct
verification at time of publication. All figures carry OpenAI’s attribution. Epoch AI hasn’t
evaluated GPT-Rosalind on LifeSciBench. None of this is independently benchmarked.

Evidence

GPT-Rosalind achieves 36.1% task pass rate on LifeSciBench, outperforming all tested frontier and domain-specialized models

Single vendor source (OpenAI evaluating own model on own benchmark). Primary URL inaccessible. No independent replication. Epoch AI evaluation pending.

That said, the benchmark’s design signals are positive. Expert authorship, domain-expert review,
and attachment-heavy task design are meaningful quality indicators for benchmark validity, even
if they don’t make the performance scores independent. A well-designed benchmark with
vendor-reported scores is meaningfully better than a poorly designed one. The 36.1% pass rate
being the ceiling isn’t a dismissal of the model; it’s evidence that the benchmark is doing
its job.

According to OpenAI, 34.8% of tasks had a best-model pass rate below 20% across all tested
models. The part nobody mentions in AI-for-science coverage: that sub-20% ceiling on roughly a
third of tasks suggests there are categories of life science reasoning that no current frontier
model handles reliably. Not “handles well”, handles at all. Drug discovery and biotech buyers
should treat that finding as a floor-setter for any vendor claiming production-readiness on
complex research workflows.

Our June
4 coverage established what GPT-Rosalind can do at the capabilities level. LifeSciBench
answers a different question: how does structured expert evaluation measure AI against real
scientific work? The answer, according to OpenAI, is that even the best model is failing most
of it. GPT-Rosalind is available via restricted trusted-access review; it isn’t a product teams
can deploy today without access authorization.

Unanswered Questions

What are the specific task categories where best-model pass rates fell below 20%, and which of those map to drug discovery vs. clinical research vs. genomics workflows?
Will OpenAI release LifeSciBench under terms that allow third-party model evaluation, making it a durable community benchmark?
What does a 36.1% pass rate translate to in practice for a research team using GPT-Rosalind on a real experimental design workflow?

What to watch

Epoch AI hasn’t yet published an independent evaluation of GPT-Rosalind on
LifeSciBench. When that arrives, the performance scores get a reference point that doesn’t
require trusting a vendor evaluating its own model on a benchmark it designed. The benchmark
itself, if OpenAI releases it under terms that allow third-party model evaluation, becomes the
more durable contribution. A rigorous, expert-designed evaluation suite for scientific AI is
useful regardless of which model tops it.

TJS synthesis

LifeSciBench is a meaningful contribution to how the field evaluates scientific
AI, design caveats aside. The 36.1% ceiling from the top model sets a concrete benchmark for
organizations that have been evaluating AI-for-science claims without a rigorous reference
point. Don’t read the failure rate as evidence AI isn’t useful in life science contexts, it
isn’t. Read it as the honest calibration number that vendor capability announcements rarely
provide. When evaluating AI for drug discovery or biotech R&D, hold vendors to this standard:
show task pass rates on LifeSciBench or an equivalent expert-designed evaluation. Vendor demos
on curated examples shouldn’t clear that bar automatically.