Recent unsupervised machine translation (UMT) systems usually employ three main principles: initialization, language modeling and iterative back-translation, though they may apply them differently. Crucially, iterative back-translation and denoising auto-encoding for language modeling provide data diversity to train the UMT systems. However, the gains from these diversification processes has seemed to plateau. We introduce a novel component to the standard UMT framework called Cross-model Back-translated Distillation (CBD), that is aimed to induce another level of data diversification that existing principles lack. CBD is applicable to all previous UMT approaches. In our experiments, CBD achieves the state of the art in the WMT'14 English-French, WMT'16 English-German and English-Romanian bilingual unsupervised translation tasks, with 38.2, 30.1, and 36.3 BLEU respectively. It also yields 1.5-3.3 BLEU improvements in IWSLT English-French and English-German tasks. Through extensive experimental analyses, we show that CBD is effective because it embraces data diversity while other similar variants do not.
翻译:最近未经监督的机器翻译系统通常采用三个主要原则:初始化、语言建模和迭代回译,尽管它们可能应用得不同。语言建模的反复反译和自译自译自译,为培训UMT系统提供了数据多样性。然而,这些多样化过程的收益似乎已经稳定下来。我们为标准的UMT框架引入了一个新颖的组成部分,称为跨模范的回翻版蒸馏(CBD),目的是产生现有原则所缺乏的另一种程度的数据多样化。《生物多样性公约》适用于以前所有UMT方法。 在我们的实验中,《生物多样性公约》取得了WMT'14英语-法语、WMT'16英语-德语和英语-罗马尼亚语双语非监督翻译任务中的先进水平,分别为38.2、30.1和36.3 BLEU。它还在IWSLT英语-法语和英语-德语任务中产生了1.5-3.3 BLEEU改进效果。通过广泛的实验分析,我们表明《生物多样性公约》是有效的,因为它包含了数据多样性,而其他类似的变式则没有。