Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.
翻译:利用大型语言模型(LLM)评估生成文本(即LLM作为评判者)已成为自然语言生成(NLG)领域的标准评估方法,但该方法主要被用作定量工具,即以数值分数作为主要输出。在本研究中,我们提出LLM作为定性评判者,这是一种基于LLM的评估方法,其主要输出为NLG系统输出中常见问题类型的结构化报告。该方法旨在为开发者提供关于如何改进特定NLG系统的有意义的见解,包含两个主要步骤:开放式的逐实例问题分析,以及通过直观的累积算法对发现的问题进行聚类。我们还提出了一种评估该方法的策略,并辅以来自12个NLG数据集的约300条实例问题标注。实验结果表明,LLM作为定性评判者输出的实例特定问题在2/3的情况下与人工标注结果一致,且该方法能够生成与人工标注者撰写的报告相似的问题类型报告。我们通过案例研究进一步证明,使用LLM作为定性评判者能够显著提升NLG系统的性能。相关代码与数据已公开于https://github.com/tunde-ajayi/llm-as-a-qualitative-judge。