Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA AI updates on arXiv.org

_ November 12, 2025_ Tech Jacks Solutions_ 0 Comments

arXiv:2408.09235v3 Announce Type: replace-cross
Abstract: The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1, while useful, are inadequate for capturing the full semantics and contextual depth of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs as judges. Through experiments on free-form question-answering tasks, we demonstrate that combining multiple models improves the reliability and accuracy of evaluations, especially in tasks where a single model may struggle. The results indicate a strong correlation with human evaluations, establishing the proposed method as a reliable alternative to traditional metrics. Read More

Author

Gallery

Contacts

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA AI updates on arXiv.org

Tech Jacks Solutions

Leave a comment Cancel reply

Services

Learn

Company

Gallery

Contacts

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA AI updates on arXiv.org

Tech Jacks Solutions

The Complete Guide to Building Data Pipelines That Don’t Break KDnuggets

Training Language Models to Explain Their Own Computations AI updates on arXiv.org

Leave a comment Cancel reply

Services

Learn

Company