长期问题解答进展的冲击 (Hurdles to Progress in Long-form Question Answering)

The task of long-form question answering (LFQA) involves retrieving documents relevant to a given question and using them to generate a paragraph-length answer. While many models have recently been proposed for LFQA, we show in this paper that the task formulation raises fundamental challenges regarding evaluation and dataset creation that currently preclude meaningful modeling progress. To demonstrate these challenges, we first design a new system that relies on sparse attention and contrastive retriever learning to achieve state-of-the-art performance on the ELI5 LFQA dataset. While our system tops the public leaderboard, a detailed analysis reveals several troubling trends: (1) our system's generated answers are not actually grounded in the documents that it retrieves; (2) ELI5 contains significant train / test overlap, as at least 81% of ELI5 validation questions occur in paraphrased form in the training set; (3) ROUGE-L is not an informative metric of generated answer quality and can be easily gamed; and (4) human evaluations used for other text generation tasks are unreliable for LFQA. We provide suggestions to mitigate each of these issues, which we hope will lead to more rigorous LFQA research and meaningful progress in the future.

翻译：长式答题(LFQA)的任务涉及检索与某个特定问题相关的文件,并利用这些文件生成一段长度的回答。虽然最近为LFQA提出了许多模型,但我们在本文件中表明,任务拟订在评估和数据集创建方面提出了根本性挑战,目前排除了有意义的建模进展。为了证明这些挑战,我们首先设计一个新的系统,依靠微弱的注意力和对比检索器学习来达到ELI5 LFQA数据集的最新性能。虽然我们的系统在公共领导板上占上风,但一项详细分析揭示出一些令人不安的趋势:(1) 我们系统产生的答案实际上并非基于它检索的文件;(2) ELI5包含重要的火车/测试重叠,因为至少81%的ELI5验证问题出现在成套培训的引文中;(3) ROUGE-L不是关于生成答案质量的信息性衡量标准,而且容易进行游戏;(4) 用于其他文本生成任务的人类评价对于LFQA来说是不可靠的。我们为缓解每一个问题提供了建议,我们希望在将来进行更加严格的LFQA研究。