用于无人监督的机器翻译的跨模版回译蒸馏 (Cross-model Back-translated Distillation for Unsupervised Machine Translation)

Recent unsupervised machine translation (UMT) systems usually employ three main principles: initialization, language modeling and iterative back-translation, though they may apply them differently. Crucially, iterative back-translation and denoising auto-encoding for language modeling provide data diversity to train the UMT systems. However, the gains from these diversification processes has seemed to plateau. We introduce a novel component to the standard UMT framework called Cross-model Back-translated Distillation (CBD), that is aimed to induce another level of data diversification that existing principles lack. CBD is applicable to all previous UMT approaches. In our experiments, CBD achieves the state of the art in the WMT'14 English-French, WMT'16 English-German and English-Romanian bilingual unsupervised translation tasks, with 38.2, 30.1, and 36.3 BLEU respectively. It also yields 1.5-3.3 BLEU improvements in IWSLT English-French and English-German tasks. Through extensive experimental analyses, we show that CBD is effective because it embraces data diversity while other similar variants do not.

翻译：最近未经监督的机器翻译系统通常采用三个主要原则:初始化、语言建模和迭代回译,尽管它们可能应用得不同。语言建模的反复反译和自译自译自译,为培训UMT系统提供了数据多样性。然而,这些多样化过程的收益似乎已经稳定下来。我们为标准的UMT框架引入了一个新颖的组成部分,称为跨模范的回翻版蒸馏(CBD),目的是产生现有原则所缺乏的另一种程度的数据多样化。《生物多样性公约》适用于以前所有UMT方法。在我们的实验中,《生物多样性公约》取得了WMT'14英语-法语、WMT'16英语-德语和英语-罗马尼亚语双语非监督翻译任务中的先进水平,分别为38.2、30.1和36.3 BLEU。它还在IWSLT英语-法语和英语-德语任务中产生了1.5-3.3 BLEEU改进效果。通过广泛的实验分析,我们表明《生物多样性公约》是有效的,因为它包含了数据多样性,而其他类似的变式则没有。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/