As online news has become increasingly popular and fake news increasingly prevalent, the ability to audit the veracity of online news content has become more important than ever. Such a task represents a binary classification challenge, for which transformers have achieved state-of-the-art results. Using the publicly available ISOT and Combined Corpus datasets, this study explores transformers' abilities to identify fake news, with particular attention given to investigating generalisation to unseen datasets with varying styles, topics and class distributions. Moreover, we explore the idea that opinion-based news articles cannot be classified as real or fake due to their subjective nature and often sensationalised language, and propose a novel two-step classification pipeline to remove such articles from both model training and the final deployed inference system. Experiments over the ISOT and Combined Corpus datasets show that transformers achieve an increase in F1 scores of up to 4.9% for out of distribution generalisation compared to baseline approaches, with a further increase of 10.1% following the implementation of our two-step classification pipeline. To the best of our knowledge, this study is the first to investigate generalisation of transformers in this context.
翻译:随着在线新闻越来越受欢迎和假新闻越来越普遍,对在线新闻内容真实性进行审计的能力比以往更加重要。 这项任务代表了二进制分类的挑战,变压器已经取得了最新的结果。 利用公开提供的ISOT和联合Corpus数据集,本研究探索变压器识别假新闻的能力,特别注意调查对具有不同风格、专题和类别分布的无形数据集的概括性调查。 此外,我们探讨基于意见的新闻报道由于其主观性质和经常耸人听闻的语言而不能被归类为真实或虚假的理念,并提出一个新的两步分类管道,从模型培训和最后部署的推论系统中删除这类文章。 有关ISOT和联合Corpus数据集的实验表明,变压器比基线方法增加了F1至4.9%的分数,在我们实施两步分类管道后又增加了10.1%。 。 据我们所知,本研究是首次调查这方面变压器的概括性。