A 2024 Google DeepMind Medical AI Study Is Driving 2026 Derivative Activity, What the Benchmarks Actually Show

April 30, 2026 3 min read arXiv (Google DeepMind Technical Report) Qualified Weak

Tech Jacks Solutions AI News Coverage

Google DeepMind's 2024 technical report on Med-Gemini claimed 91.1% accuracy on the MedQA benchmark, a vendor-reported figure that has never received independent evaluation. Two years later, derivative models built on that research are proliferating on Hugging Face leaderboards, and the original benchmark gap still hasn't closed.

medical-ai google-deepmind med-gemini benchmark-verification llm-healthcare generative-ai

91.1% MedQA, vendor-reported, no independent eval

Key Takeaways

Med-Gemini's 91.1% MedQA score originates from a 2024 Google DeepMind vendor technical report, it has not been independently reproduced
The 4.6-point improvement over Med-PaLM 2 (86.5% baseline) is arithmetically confirmed, but both figures come from the same vendor source family
Fine-tuned Med-Gemini derivatives are proliferating on Hugging Face as of April 2026, derivative models inherit the benchmark description, not the original evaluation rigor
Epoch AI's independent benchmark coverage of the Med-Gemini model family is currently pending

Model Release

Med-Gemini (arXiv:2404.18416)

OrganizationGoogle / DeepMind

TypeLLM — Medical Domain (research, 2024)

ParametersNot disclosed

Benchmark[SELF-REPORTED] MedQA (USMLE): 91.1% per arXiv:2404.18416

AvailabilityResearch only, not production API

Warning

The 91.1% MedQA figure is a vendor-reported benchmark from a 2024 Google DeepMind technical paper. No independent third-party evaluation has been published. Epoch AI's assessment is pending. Enterprise buyers should not treat this score as independently verified performance data.

MedQA Accuracy (USMLE benchmark)

Med-Gemini (2024)

91.1% [SELF-REPORTED]

Med-PaLM 2

86.5% [Google Research]

Independent eval

Pending (Epoch AI)

The paper making the rounds isn’t new. arXiv:2404.18416, titled “Capabilities of Gemini Models in Medicine,” was submitted by Google and DeepMind researchers in April 2024. What’s new is the downstream activity: as of late April 2026, fine-tuned Med-Gemini derivatives are appearing on Hugging Face model leaderboards in growing numbers, and the Epoch AI Notable AI Models database, updated April 27, 2026, has added compute-context data for the model family. Neither development constitutes an independent benchmark evaluation. Both have renewed practitioner attention to the original numbers.

Those numbers, taken from the technical report: Med-Gemini achieves 91.1% accuracy on MedQA, a USMLE-style question set used as a proxy for clinical reasoning performance. According to the same paper, Google’s prior medical model, Med-PaLM 2, scored 86.5% on the same benchmark, a 4.6-point improvement. The report also states Med-Gemini outperforms GPT-4V across seven multimodal benchmarks by an average relative margin of 44.5%. These figures come from the vendor’s own technical report. They have been amplified by press coverage and referenced in derivative model documentation, but they haven’t been reproduced by an independent evaluator.

That distinction matters more in medicine than in most AI domains.

Medical AI benchmarks carry downstream weight. When a system’s benchmark performance becomes the basis for procurement decisions, clinical workflow integration, or regulatory filings, the question of who ran the evaluation isn’t procedural, it’s material. The MedQA benchmark tests the kind of medical knowledge a physician licensing exam requires. Achieving 91.1% on it is a meaningful result. What the score doesn’t tell you: how the model performs on the specific patient population, documentation workflow, or EHR system your organization uses. Benchmark conditions and production conditions diverge in any AI deployment. In healthcare, that divergence can carry clinical consequences.

Google DeepMind’s primary announcement page for Med-Gemini was not accessible for verification at publication time, and no independent third-party evaluation of the MedQA or multimodal benchmark results was available. One specific claim in the technical report, expert-level performance on 3D CT scan report generation, could not be confirmed through available cross-references and should be treated as an unverified vendor assertion until a primary or independent source confirms it.

Epoch AI’s ongoing benchmark coverage of the model family is pending. When Epoch publishes independent evaluation data for Med-Gemini or its derivatives, that will represent the first third-party assessment of the model’s performance claims.

The Hugging Face derivative wave is worth watching for a different reason. Fine-tuned variants built on Med-Gemini’s architecture are appearing without the same documentation their parent model has. A derivative model carrying Med-Gemini’s benchmark lineage in its description is not the same as a model that has itself been evaluated, vendor-reported or otherwise. Enterprise healthcare teams evaluating any model in this family should distinguish clearly between the original research paper’s claims and what a specific fine-tuned deployment has been tested against.

TJS synthesis: The 2026 renewal of interest in Med-Gemini’s 2024 benchmarks is a useful case study in how vendor-reported AI performance data travels. A number stated in a technical report becomes a number cited in press coverage, which becomes a number referenced in derivative model documentation, which becomes a number that appears in an enterprise vendor’s procurement pitch. Each step adds distance from the original evaluation methodology. For healthcare technology teams, the practical question isn’t whether 91.1% is impressive, it is, but whether the evaluation that produced it tells you what you need to know about your specific deployment context.

View Source

More Technology intelligence

View all Technology

Deep Dive Available Medical AI Benchmarks Are Almost Always Vendor-Reported. Here's What Enterprise Healthcare Buyers...

Gallery

Contacts

A 2024 Google DeepMind Medical AI Study Is Driving 2026 Derivative Activity, What the Benchmarks Actually Show

Services

Learn

Company

Gallery

Contacts

A 2024 Google DeepMind Medical AI Study Is Driving 2026 Derivative Activity, What the Benchmarks Actually Show

Stay ahead on Technology

Services

Learn

Company