Multilinguality is gradually becoming ubiquitous in the sense that more and more researchers have successfully shown that using additional languages help improve the results in many Natural Language Processing tasks. Multilingual Multiway Corpora (MMC) contain the same sentence in multiple languages. Such corpora have been primarily used for Multi-Source and Pivot Language Machine Translation but are also useful for developing multilingual sequence taggers by transfer learning. While these corpora are available, they are not organized for multilingual experiments and researchers need to write boilerplate code every time they want to use said corpora. Moreover, because there is no official MMC collection it becomes difficult to compare against existing approaches. As such we present our work on creating a unified and systematically organized repository of MMC spanning a large number of languages. We also provide training, development and test splits for corpora where official splits are unavailable. We hope that this will help speed up the pace of multilingual NLP research and ensure that NLP researchers obtain results that are more trustable since they can be compared easily. We indicate corpora sources, extraction procedures if any and relevant statistics. We also make our collection public for research purposes.
翻译:多语言多语言多语种公司(MMC)以多种语言提供相同的句子。这种公司主要用于多源和主控语言机器翻译,但对于通过转移学习开发多语种序列标记也有用。虽然这些公司可以使用,但并不是为多语种实验组织起来,研究人员需要每次使用所述公司时都编写锅炉代码。此外,由于没有正式的MMC收集工作难以与现有方法进行比较,因此我们介绍我们如何建立一个统一和系统化的多语种混合语言存储库的工作。我们还为没有官方分裂的Corbora提供培训、开发和测试。我们希望这将有助于加快多语种NLP研究的步伐,并确保NLP研究人员获得更容易比较的结果。我们还指出公司来源、提取程序(如果有的话)和相关的统计数据,我们还为公共研究目的收集。