A number that keeps getting cited.
In April 2024, Google and DeepMind researchers published arXiv:2404.18416, “Capabilities of Gemini Models in Medicine.” The paper reported that Med-Gemini achieves 91.1% accuracy on MedQA, a benchmark built around USMLE-style medical licensing exam questions. The paper also reported that Med-Gemini outperforms GPT-4V across seven multimodal benchmarks by an average relative margin of 44.5%. Both figures appeared in press coverage. Both figures are now appearing in documentation for fine-tuned Med-Gemini derivatives on Hugging Face. Both figures originated in a technical report written by the vendor’s own researchers.
That last sentence isn’t an accusation. It’s a structural fact with practical consequences.
Section 1: What 91.1% on MedQA actually means
MedQA tests whether a model can answer the kinds of questions that appear on the United States Medical Licensing Examination. It’s a well-constructed benchmark with a clear methodology and a meaningful baseline: passing-level human performance on the USMLE is generally cited around 60%, and trained physicians score considerably higher. A model achieving 91.1% on this benchmark is doing something real.
What it isn’t doing: treating patients, navigating an EHR system, generating documentation within a specific hospital’s clinical workflow, or handling the ambiguous, multi-step reasoning that real clinical encounters require. The MedQA benchmark measures a specific type of medical knowledge recall and reasoning under controlled conditions. The gap between controlled benchmark conditions and production deployment conditions is real in every AI domain. In healthcare, it carries direct clinical stakes.
The arithmetic in the paper checks out. Google’s prior medical model, Med-PaLM 2, achieved 86.5% on the same MedQA benchmark per a separate Google Research source. Med-Gemini’s 4.6-point improvement over that baseline is internally consistent. The comparison to GPT-4V is more complex: the 44.5% relative margin claim spans seven multimodal benchmarks with different methodologies, and GPT-4V is OpenAI’s multimodal model from the same 2024 timeframe, not a current-generation comparison. Both numbers, checked against the paper’s own internal data, hold up arithmetically.
The problem isn’t the arithmetic. It’s the independence gap.
Section 2: The verification hierarchy, and where Med-Gemini sits in it
Benchmark evaluation in AI can come from several sources, each carrying different weight. Independent third-party evaluation, from organizations like Epoch AI, academic research groups, or clinical validation studies, represents the highest confidence level. Vendor technical reports represent the lowest confidence level that still counts as formal documentation. Press coverage of vendor technical reports is amplification, not verification.
Med-Gemini’s benchmark data sits at the vendor technical report tier. The arXiv paper is authored by Google and DeepMind researchers. arXiv hosting doesn’t confer independence, it means the paper is publicly accessible, not that its methodology has been independently reproduced. Epoch AI’s Notable AI Models database, updated April 27, 2026, includes compute-context data for the Med-Gemini model family, but Epoch has not published an independent benchmark evaluation for MedQA performance. That gap is pending.
This isn’t unusual. Most medical AI benchmark claims live at the vendor technical report tier because independent clinical AI evaluation is genuinely difficult to conduct. Medical benchmark reproducibility requires access to the same data splits, the same evaluation infrastructure, and clinical domain expertise to assess whether the benchmark questions map to real clinical utility. The ecosystem of independent medical AI evaluators is thin compared to what exists for general-purpose models.
That structural gap matters when enterprise healthcare teams are trying to compare medical AI systems across vendors. OpenAI’s GPT-Rosalind, Google’s Med-Gemini, and clinical AI platforms like Abridge are each optimizing for different things and reporting their performance through different methodologies. Without a neutral evaluation framework, an equivalent of LMSYS Chatbot Arena for medical AI, buyers are comparing vendor claims against vendor claims.
Section 3: The 2026 derivative wave, what it does and doesn’t tell you
As of late April 2026, fine-tuned derivatives of Med-Gemini are appearing on Hugging Face leaderboards. This is a meaningful signal about the model’s real-world adoption: researchers and developers found the architecture useful enough to build on. Derivative model activity is one of the more honest signals of a model’s practical value, because it reflects choices made by people working outside the vendor’s organization.
What it doesn’t tell you: whether those derivatives carry the benchmark performance of the original model. A fine-tuned model inherits architecture, not evaluation results. When a derivative model’s documentation references Med-Gemini’s 91.1% MedQA score, that number describes the base model’s performance on a specific benchmark under specific conditions, not the derivative’s performance in any deployment context. The benchmark lineage travels with the model description. The evaluation rigor doesn’t.
Enterprise healthcare teams evaluating any model in the Med-Gemini family need to distinguish three separate questions: (1) What did the base model demonstrate on controlled benchmarks? (2) What has the specific fine-tuned variant been evaluated on, by whom? (3) What does internal testing in your specific clinical environment show?
The answer to question 1 is documented in arXiv:2404.18416, with the vendor-origin caveat attached. The answers to questions 2 and 3 will vary by derivative and deployment.
Section 4: Why independent medical AI evaluation is structurally hard
Building an independent benchmark for medical AI is harder than building one for general-purpose models for three reasons.
Data availability: clinical evaluation datasets are often protected by HIPAA or institutional data agreements that prevent broad third-party access. The MedQA dataset is public, which is why it’s used so frequently, but public benchmarks are also optimized against more aggressively, which reduces their signal quality over time.
Domain expertise requirements: assessing whether a model’s clinical reasoning is actually correct (as opposed to producing plausible-sounding text) requires clinical expertise that general AI evaluation organizations don’t have in house. Epoch AI’s model evaluation work focuses primarily on compute, training data, and general capability benchmarks, not clinical validation.
Evaluation fragmentation: medical AI applications span radiology, pathology, primary care, pharmacology, and dozens of other subdisciplines. A benchmark that is rigorous for oncology imaging interpretation may say nothing about medication interaction flagging. No single benchmark or evaluation framework covers the medical AI space the way MMLU or HumanEval cover general language and coding capability.
This isn’t an argument against using medical AI. It’s an argument for structuring evaluation appropriately, which means not outsourcing that evaluation entirely to vendor-published benchmark scores.
Section 5: What enterprise healthcare buyers should actually ask
If you’re evaluating medical AI for a clinical or operational deployment, the benchmark score in the technical paper is a starting point, not a decision point. Five questions that get closer to a useful answer:
What is the benchmark testing, exactly?
MedQA tests USMLE-style questions. If your deployment involves radiology report generation, medication reconciliation, or clinical documentation, MedQA tells you something about the model’s general medical knowledge, and much less about your specific use case.
Who ran the evaluation?
Vendor technical report, independent third-party, or internal validation? Each tier carries different confidence weight. Ask vendors for independent evaluation data. If none exists, that’s information.
How recent is the evaluation?
The Med-Gemini report is from 2024. Model families iterate. A 2024 benchmark on an architecture that has since been fine-tuned into dozens of derivatives is an increasingly indirect signal.
What does your clinical environment actually require?
EHR integration, documentation workflow, latency requirements at clinical throughput, and clinician usability are evaluation dimensions that no public benchmark covers. Internal piloting with defined clinical outcomes metrics is the only way to get this data.
What is the regulatory status?
Medical AI deployments in most jurisdictions require regulatory review that goes beyond benchmark performance. FDA clearance, CE marking under the EU Medical Device Regulation, and institutional clinical governance all impose evaluation requirements that vendor benchmarks don’t satisfy.
The 91.1% figure is real. The research behind it is serious. And for enterprise healthcare teams making production deployment decisions, it’s the beginning of the evaluation, not the end.