We propose a model-based metric to estimate the factual accuracy of generated text that is complementary to typical scoring schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We introduce and release a new large-scale dataset based on Wikipedia and Wikidata to train relation classifiers and end-to-end fact extraction models. The end-to-end models are shown to be able to extract complete sets of facts from datasets with full pages of text. We then analyse multiple models that estimate factual accuracy on a Wikipedia text summarization task, and show their efficacy compared to ROUGE and other model-free variants by conducting a human evaluation study.
翻译:我们建议采用基于模型的衡量标准来估计生成文本的事实准确性,这种衡量标准是对典型的评分计划,如“Gisting Evaluation”和“BLEU”(双语评价基础研究)的补充;我们采用并发布基于维基百科和维基数据的新大规模数据集,以培训关系分类员和端到端的事实提取模型;端到端模型显示能够从全页的数据集中提取完整的成套事实;然后我们分析多种模型,以估计维基百科文本汇总任务的事实准确性,并通过进行人类评估研究来显示这些模型与“ROUGE”和其他无模式变量相比的有效性。