Unsupervised neural machine translation (UNMT) is beneficial especially for low resource languages such as those from the Dravidian family. However, UNMT systems tend to fail in realistic scenarios involving actual low resource languages. Recent works propose to utilize auxiliary parallel data and have achieved state-of-the-art results. In this work, we focus on unsupervised translation between English and Kannada, a low resource Dravidian language. We additionally utilize a limited amount of auxiliary data between English and other related Dravidian languages. We show that unifying the writing systems is essential in unsupervised translation between the Dravidian languages. We explore several model architectures that use the auxiliary data in order to maximize knowledge sharing and enable UNMT for distant language pairs. Our experiments demonstrate that it is crucial to include auxiliary languages that are similar to our focal language, Kannada. Furthermore, we propose a metric to measure language similarity and show that it serves as a good indicator for selecting the auxiliary languages.
翻译:无人监督的神经机器翻译(UNMT)尤其有利于诸如Dravidian家族的低资源语言等低资源语言。然而,UNMT系统在实际使用低资源语言的现实情景中往往会失败。最近的工作提议使用辅助平行数据,并取得了最先进的成果。在这项工作中,我们侧重于英语和Kannada(一种低资源Dravidian语言)之间未经监督的翻译。我们另外利用了英语和其他相关的Dravidian语言之间数量有限的辅助数据。我们表明,在Dravidian语言之间未经监督的翻译中,统一书写系统是必不可少的。我们探索了使用辅助数据的一些模型结构,以便最大限度地分享知识并使UNMT能够为远语言配对。我们的实验表明,必须包括类似于我们联系语言Kannada的辅助语言。此外,我们提出了衡量语言相似性并显示它作为选择辅助语言的良好指标的衡量标准。