With the rapid evolution of social media, fake news has become a significant social problem, which cannot be addressed in a timely manner using manual investigation. This has motivated numerous studies on automating fake news detection. Most studies explore supervised training models with different modalities (e.g., text, images, and propagation networks) of news records to identify fake news. However, the performance of such techniques generally drops if news records are coming from different domains (e.g., politics, entertainment), especially for domains that are unseen or rarely-seen during training. As motivation, we empirically show that news records from different domains have significantly different word usage and propagation patterns. Furthermore, due to the sheer volume of unlabelled news records, it is challenging to select news records for manual labelling so that the domain-coverage of the labelled dataset is maximized. Hence, this work: (1) proposes a novel framework that jointly preserves domain-specific and cross-domain knowledge in news records to detect fake news from different domains; and (2) introduces an unsupervised technique to select a set of unlabelled informative news records for manual labelling, which can be ultimately used to train a fake news detection model that performs well for many domains while minimizing the labelling cost. Our experiments show that the integration of the proposed fake news model and the selective annotation approach achieves state-of-the-art performance for cross-domain news datasets, while yielding notable improvements for rarely-appearing domains in news datasets.
翻译:随着社交媒体的迅速演变,假新闻已成为一个重大的社会问题,无法通过人工调查及时解决。这促使了许多关于假新闻探测自动化的研究。大多数研究探索了以不同方式(如文本、图像和传播网络)对新闻记录进行监管的培训模式,以识别假新闻。然而,如果新闻记录来自不同领域(如政治、娱乐),这些技术的绩效一般会下降,特别是对于培训期间不为人知或很少见的领域。作为动机,我们从经验上表明,不同领域的新闻记录有显著不同的文字使用和传播模式。此外,由于大量未贴标签的新闻记录,因此选择手工标签的新闻记录具有挑战性,以便尽可能扩大贴标签数据集的域覆盖。因此,这项工作:(1) 提出一个新框架,共同保存特定领域和交叉在新闻记录中的交叉知识,以探测来自不同领域的假新闻;(2) 引入一种未经反复检验的技术,以选择一组未贴标签的新闻记录用于手工标签的版本。此外,由于大量未贴标签的新闻记录,因此,很难选择用于手工标签的域,从而最大限度地扩大贴标签标签。因此,将标有标签的数据集的域进行模拟检测,同时进行模拟的模拟的模拟的模拟数据模拟测试。