Despite significant interest in developing general purpose fact checking models, it is challenging to construct a large-scale fact verification dataset with realistic claims that would occur in the real world. Existing claims are either authored by crowdworkers, thereby introducing subtle biases that are difficult to control for, or manually verified by professional fact checkers, causing them to be expensive and limited in scale. In this paper, we construct a challenging, realistic, and large-scale fact verification dataset called FaVIQ, using information-seeking questions posed by real users who do not know how to answer. The ambiguity in information-seeking questions enables automatically constructing true and false claims that reflect confusions arisen from users (e.g., the year of the movie being filmed vs. being released). Our claims are verified to be natural, contain little lexical bias, and require a complete understanding of the evidence for verification. Our experiments show that the state-of-the-art models are far from solving our new task. Moreover, training on our data helps in professional fact-checking, outperforming models trained on the most widely used dataset FEVER or in-domain data by up to 17% absolute. Altogether, our data will serve as a challenging benchmark for natural language understanding and support future progress in professional fact checking.
翻译:尽管对开发通用目的事实核查模型的兴趣很大,但建立大规模的事实核查数据集,在现实世界中将出现现实的主张,却具有挑战性;现有主张要么是由众工编写的,从而引入难以控制或难以由专业事实检查员人工核实的微妙偏见,导致其费用昂贵且规模有限;在本文中,我们建立一个挑战性、现实性和大规模的事实核查数据集,即FaVIQ,使用不懂得如何回答的真正用户提出的信息查询问题。信息查询问题的模糊性使得能够自动构建真实和虚假的主张,反映用户产生的混乱(例如,电影拍摄年份与发布时间之比) ; 我们的主张被核实为自然,几乎没有法律偏见,需要充分了解核查的证据。 我们的实验表明,最先进的模型远没有解决我们的新任务。 此外,关于我们数据的培训有助于专业事实的核对,超过在最广泛使用的FTER或DOmain数据方面受过培训的模型,达到17 % 绝对的绝对标准。我们的数据将作为未来专业认识的绝对性基准,用于检验。