Hate speech detection has become an important research topic within the past decade. More private corporations are needing to regulate user generated content on different platforms across the globe. In this paper, we introduce a study of multilingual hate speech classification. We compile a dataset of 11 languages and resolve different taxonomies by analyzing the combined data with binary labels: hate speech or not hate speech. Defining hate speech in a single way across different languages and datasets may erase cultural nuances to the definition, therefore, we utilize language agnostic embeddings provided by LASER and MUSE in order to develop models that can use a generalized definition of hate speech across datasets. Furthermore, we evaluate prior state of the art methodologies for hate speech detection under our expanded dataset. We conduct three types of experiments for a binary hate speech classification task: Multilingual-Train Monolingual-Test, MonolingualTrain Monolingual-Test and Language-Family-Train Monolingual Test scenarios to see if performance increases for each language due to learning more from other language data.
翻译:在过去十年中,发现仇恨言论已成为一个重要的研究课题。更多的私营公司需要监管全球不同平台上用户生成的内容。在本文中,我们引入了多语言仇恨言论分类研究。我们汇编了11种语言的数据集,并通过分析与二进制标签相结合的数据解决了不同的分类:仇恨言论或不是仇恨言论。以单一方式界定不同语言和数据集的仇恨言论可能会消除定义的文化细微差别,因此,我们使用LASER和MUSE提供的语言敏感嵌入器,以便开发能够使用跨数据集仇恨言论通用定义的模型。此外,我们评估了在扩大数据集下检测仇恨言论的艺术方法的先前状态。我们开展了三种双重仇恨言论分类任务实验:多语言语言语言语言语言测试、单语言语言测试和语言-家庭语言-语言-语言-语言-语言-语言-语言-语言-语言-语言-语言测试方案,以了解每种语言的性能是否因学习更多其他语言数据而提高。