Most AI benchmark announcements lead with what the model achieved. LifeSciBench inverts that. The
editorial number here isn’t GPT-Rosalind’s 36.1% task pass rate, it’s the 63.9% it didn’t pass. According to OpenAI, that’s the best result any model produced on this evaluation suite. For
organizations evaluating AI for drug discovery, biotech R&D, or clinical research workflows, that
gap is the most honest data point available on where the technology actually stands.
The benchmark details matter for interpreting the number. According to OpenAI’s publication,
LifeSciBench consists of 750 tasks spanning seven core scientific workflows and seven biological
domains, developed with 173 scientist contributors and reviewed by 453 domain experts. OpenAI
states that 53% of tasks require attachment review, meaning this isn’t a text-only evaluation. Models had to process documents, figures, and data files representative of real research
workflows, not synthetic question-answer pairs. That design choice makes the failure rate more
informative than a standard benchmark would be.
The catch is the sourcing. OpenAI developed LifeSciBench, OpenAI ran the evaluations, and OpenAI
published the results. The expert involvement was in benchmark design and review, not in
independently evaluating the models. According to OpenAI’s results, GPT-Rosalind achieved a
36.1% task pass rate and a problem-weighted normalized score of 0.576, outperforming comparison
frontier and domain-specialized models. Those model names and their specific scores appear in
OpenAI’s publication; specific version designations for comparison models can’t be confirmed from
resolved sources and are omitted here. The primary source URL was inaccessible for direct
verification at time of publication. All figures carry OpenAI’s attribution. Epoch AI hasn’t
evaluated GPT-Rosalind on LifeSciBench. None of this is independently benchmarked.
Evidence
That said, the benchmark’s design signals are positive. Expert authorship, domain-expert review,
and attachment-heavy task design are meaningful quality indicators for benchmark validity, even
if they don’t make the performance scores independent. A well-designed benchmark with
vendor-reported scores is meaningfully better than a poorly designed one. The 36.1% pass rate
being the ceiling isn’t a dismissal of the model; it’s evidence that the benchmark is doing
its job.
According to OpenAI, 34.8% of tasks had a best-model pass rate below 20% across all tested
models. The part nobody mentions in AI-for-science coverage: that sub-20% ceiling on roughly a
third of tasks suggests there are categories of life science reasoning that no current frontier
model handles reliably. Not “handles well”, handles at all. Drug discovery and biotech buyers
should treat that finding as a floor-setter for any vendor claiming production-readiness on
complex research workflows.
Our June
4 coverage established what GPT-Rosalind can do at the capabilities level. LifeSciBench
answers a different question: how does structured expert evaluation measure AI against real
scientific work? The answer, according to OpenAI, is that even the best model is failing most
of it. GPT-Rosalind is available via restricted trusted-access review; it isn’t a product teams
can deploy today without access authorization.
Unanswered Questions
- What are the specific task categories where best-model pass rates fell below 20%, and which of those map to drug discovery vs. clinical research vs. genomics workflows?
- Will OpenAI release LifeSciBench under terms that allow third-party model evaluation, making it a durable community benchmark?
- What does a 36.1% pass rate translate to in practice for a research team using GPT-Rosalind on a real experimental design workflow?
What to watch
Epoch AI hasn’t yet published an independent evaluation of GPT-Rosalind on
LifeSciBench. When that arrives, the performance scores get a reference point that doesn’t
require trusting a vendor evaluating its own model on a benchmark it designed. The benchmark
itself, if OpenAI releases it under terms that allow third-party model evaluation, becomes the
more durable contribution. A rigorous, expert-designed evaluation suite for scientific AI is
useful regardless of which model tops it.
TJS synthesis
LifeSciBench is a meaningful contribution to how the field evaluates scientific
AI, design caveats aside. The 36.1% ceiling from the top model sets a concrete benchmark for
organizations that have been evaluating AI-for-science claims without a rigorous reference
point. Don’t read the failure rate as evidence AI isn’t useful in life science contexts, it
isn’t. Read it as the honest calibration number that vendor capability announcements rarely
provide. When evaluating AI for drug discovery or biotech R&D, hold vendors to this standard:
show task pass rates on LifeSciBench or an equivalent expert-designed evaluation. Vendor demos
on curated examples shouldn’t clear that bar automatically.