Misinformation on social media is a widely acknowledged issue, and researchers worldwide are actively engaged in its detection. However, low-resource languages such as Urdu have received limited attention in this domain. An obvious approach is to utilize a multilingual pretrained language model and fine-tune it for a downstream classification task, such as misinformation detection. However, these models struggle with domain-specific terms, leading to suboptimal performance. To address this, we investigate the effectiveness of domain adaptation before fine-tuning for fake news classification in Urdu, employing a staged training approach to optimize model generalization. We evaluate two widely used multilingual models, XLM-RoBERTa and mBERT, and apply domain-adaptive pretraining using a publicly available Urdu news corpus. Experiments on four publicly available Urdu fake news datasets show that domain-adapted XLM-R consistently outperforms its vanilla counterpart, while domain-adapted mBERT exhibits mixed results.
翻译:社交媒体上的虚假信息是一个公认的普遍问题,全球研究人员正积极致力于其检测。然而,乌尔都语等低资源语言在此领域受到的关注有限。一种直接的方法是使用多语言预训练语言模型,并针对下游分类任务(如虚假信息检测)进行微调。然而,这些模型在处理领域特定术语时存在困难,导致性能欠佳。为解决此问题,我们研究了在针对乌尔都语假新闻分类进行微调前实施领域自适应的有效性,采用分阶段训练方法以优化模型泛化能力。我们评估了两种广泛使用的多语言模型XLM-RoBERTa和mBERT,并利用公开可用的乌尔都语新闻语料库进行领域自适应预训练。在四个公开可用的乌尔都语假新闻数据集上的实验表明,经过领域自适应的XLM-R模型始终优于其原始版本,而经过领域自适应的mBERT则表现出不稳定的结果。