We present a simple cross-lingual plagiarism detection method applicable to a large number of languages. The presented approach leverages open multilingual thesauri for candidate retrieval task and pre-trained multilingual BERT-based language models for detailed analysis. The method does not rely on machine translation and word sense disambiguation when in use, and therefore is suitable for a large number of languages, including under-resourced languages. The effectiveness of the proposed approach is demonstrated for several existing and new benchmarks, achieving state-of-the-art results for French, Russian, and Armenian languages.
翻译:我们提出了一种适用于大量语言的简单跨语言抄袭检测方法。所提出的方法利用开放的多语言词库进行候选项检索任务,并利用预先训练的基于BERT的多语言语言模型进行详细分析。该方法不依赖于机器翻译和词义消歧,因此适用于包括资源匮乏的语言在内的大量语言。我们在多个现有和新的基准测试中展示了该方法的有效性,其中在法语、俄语和亚美尼亚语中实现了最先进的结果。