CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research cs.AI updates on arXiv.org

_ October 15, 2025_ Tech Jacks Solutions_ 0 Comments

arXiv:2510.11985v1 Announce Type: new
Abstract: Variant and gene interpretation are fundamental to personalized medicine and translational biomedicine. However, traditional approaches are manual and labor-intensive. Generative language models (LMs) can facilitate this process, accelerating the translation of fundamental research into clinically-actionable insights. While existing benchmarks have attempted to quantify the capabilities of LMs for interpreting scientific data, these studies focus on narrow tasks that do not translate to real-world research. To meet these challenges, we introduce CGBench, a robust benchmark that tests reasoning capabilities of LMs on scientific publications. CGBench is built from ClinGen, a resource of expert-curated literature interpretations in clinical genetics. CGBench measures the ability to 1) extract relevant experimental results following precise protocols and guidelines, 2) judge the strength of evidence, and 3) categorize and describe the relevant outcome of experiments. We test 8 different LMs and find that while models show promise, substantial gaps exist in literature interpretation, especially on fine-grained instructions. Reasoning models excel in fine-grained tasks but non-reasoning models are better at high-level interpretations. Finally, we measure LM explanations against human explanations with an LM judge approach, revealing that models often hallucinate or misinterpret results even when correctly classifying evidence. CGBench reveals strengths and weaknesses of LMs for precise interpretation of scientific publications, opening avenues for future research in AI for clinical genetics and science more broadly. Read More

Author

Gallery

Contacts

CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research cs.AI updates on arXiv.org

Tech Jacks Solutions

Leave a comment Cancel reply

Our Address

Our Mailbox

Our Phone

Gallery

Contacts

CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research cs.AI updates on arXiv.org

Tech Jacks Solutions

Asking Clarifying Questions for Preference Elicitation With Large Language Models cs.AI updates on arXiv.org

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation cs.AI updates on arXiv.org

Leave a comment Cancel reply

Our Address

Our Mailbox

Our Phone