Over the last few years, verifying the credibility of information sources has become a fundamental need to combat disinformation. Here, we present a language-agnostic model designed to assess the reliability of web domains as sources in references across multiple language editions of Wikipedia. Utilizing editing activity data, the model evaluates domain reliability within different articles of varying controversiality, such as Climate Change, COVID-19, History, Media, and Biology topics. Crafting features that express domain usage across articles, the model effectively predicts domain reliability, achieving an F1 Macro score of approximately 0.80 for English and other high-resource languages. For mid-resource languages, we achieve 0.65, while the performance of low-resource languages varies. In all cases, the time the domain remains present in the articles (which we dub as permanence) is one of the most predictive features. We highlight the challenge of maintaining consistent model performance across languages of varying resource levels and demonstrate that adapting models from higher-resource languages can improve performance. We believe these findings can assist Wikipedia editors in their ongoing efforts to verify citations and may offer useful insights for other user-generated content communities.
翻译:近年来,验证信息来源的可信度已成为打击虚假信息的基本需求。本文提出一种语言无关模型,旨在评估维基百科多语言版本中作为参考文献的网页域名的可靠性。该模型利用编辑活动数据,通过分析不同争议性文章(如气候变化、COVID-19、历史、媒体和生物学主题)中域名的使用情况,构建了表征域名跨文章使用特征的特征集,有效预测域名可靠性。对于英语及其他高资源语言,模型F1宏平均得分约为0.80;中资源语言达到0.65;低资源语言性能则存在波动。在所有案例中,域名在文章中存续的时间(我们称之为“持久性”)是最具预测力的特征之一。我们强调了在资源水平不同的语言间保持模型性能一致性的挑战,并证明通过适配高资源语言模型可提升性能。我们相信这些发现有助于维基百科编辑持续开展引文核查工作,并为其他用户生成内容社区提供有益参考。