重新使用TRACRED:解决TRACRED数据集的缺陷 (Re-TACRED: Addressing Shortcomings of the TACRED Dataset)

TACRED is one of the largest and most widely used sentence-level relation extraction datasets. Proposed models that are evaluated using this dataset consistently set new state-of-the-art performance. However, they still exhibit large error rates despite leveraging external knowledge and unsupervised pretraining on large text corpora. A recent study suggested that this may be due to poor dataset quality. The study observed that over 50% of the most challenging sentences from the development and test sets are incorrectly labeled and account for an average drop of 8% f1-score in model performance. However, this study was limited to a small biased sample of 5k (out of a total of 106k) sentences, substantially restricting the generalizability and broader implications of its findings. In this paper, we address these shortcomings by: (i) performing a comprehensive study over the whole TACRED dataset, (ii) proposing an improved crowdsourcing strategy and deploying it to re-annotate the whole dataset, and (iii) performing a thorough analysis to understand how correcting the TACRED annotations affects previously published results. After verification, we observed that 23.9% of TACRED labels are incorrect. Moreover, evaluating several models on our revised dataset yields an average f1-score improvement of 14.3% and helps uncover significant relationships between the different models (rather than simply offsetting or scaling their scores by a constant factor). Finally, aside from our analysis we also release Re-TACRED, a new completely re-annotated version of the TACRED dataset that can be used to perform reliable evaluation of relation extraction models.

翻译：TACRED是最大和最广泛使用的判刑级提取数据组之一。使用该数据集评价的拟议模型持续地设定了新的最新性能。然而,尽管利用了外部知识和对大文本公司进行未经监督的预先培训,这些模型仍然显示出很大的误差率。最近的一项研究表明,这可能是由于数据集质量差所致。研究表明,开发和测试组中50%以上最具挑战性的判决标签不正确,导致模型性能平均下降8% f1 核心。然而,这一研究限于5k(总共106k)小偏差的抽样,大大限制了其调查结果的可概括性和更广泛的影响。在本文件中,我们处理这些缺点的方法是:(一) 对整个TRRED数据集进行全面研究,(二) 提出改进的众包战略,并部署它来重新点知整个数据集,以及(三) 进行彻底分析,以了解纠正TRRED说明如何影响先前公布的结果。在核查后,我们发现TREARD3中23.9 %的升级关系, 也帮助了TRADA标准排名第14级的升级。