适用于低资源多语言情感分析的自适应预训练和源语言选择： NLNDE 在 SemEval-2023 任务12 中的应用 (NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language Selection for Low-Resource Multilingual Sentiment Analysis)

This paper describes our system developed for the SemEval-2023 Task 12 "Sentiment Analysis for Low-resource African Languages using Twitter Dataset". Sentiment analysis is one of the most widely studied applications in natural language processing. However, most prior work still focuses on a small number of high-resource languages. Building reliable sentiment analysis systems for low-resource languages remains challenging, due to the limited training data in this task. In this work, we propose to leverage language-adaptive and task-adaptive pretraining on African texts and study transfer learning with source language selection on top of an African language-centric pretrained language model. Our key findings are: (1) Adapting the pretrained model to the target language and task using a small yet relevant corpus improves performance remarkably by more than 10 F1 score points. (2) Selecting source languages with positive transfer gains during training can avoid harmful interference from dissimilar languages, leading to better results in multilingual and cross-lingual settings. In the shared task, our system wins 8 out of 15 tracks and, in particular, performs best in the multilingual evaluation.

翻译：本文描述了我们为 SemEval-2023 任务 12“使用推特数据集的低资源非洲语言情感分析”开发的系统。情感分析是自然语言处理中研究最广泛的应用之一，然而，大部分前期工作仍然集中在少数高资源语言上。为低资源语言构建可靠的情感分析系统仍然具有挑战性，因为在该任务中训练数据数量有限。在这项工作中，我们建议利用语言自适应和任务自适应的预训练模型在非洲文本上进行研究，并研究使用非洲语言为中心的预训练语言模型的源语言选择上的转移学习。我们的主要发现是：(1) 采用适合目标语言和任务的小型但相关的语料库对预训练模型进行调整可以显著提高性能超过10个F1分数点。(2)在训练期间选择具有积极转移收益的源语言可以避免来自不相似语言的有害干扰，在多语言和跨语言设置中导致更好的结果。在共享任务中，我们的系统赢得了15个赛道中的8个，并且在多语言评估中表现最佳。