Despite the reported success of unsupervised machine translation (MT), the field has yet to examine the conditions under which these methods succeed, and where they fail. We conduct an extensive empirical evaluation of unsupervised MT using dissimilar language pairs, dissimilar domains, diverse datasets, and authentic low-resource languages. We find that performance rapidly deteriorates when source and target corpora are from different domains, and that random word embedding initialization can dramatically affect downstream translation performance. We additionally find that unsupervised MT performance declines when source and target languages use different scripts, and observe very poor performance on authentic low-resource language pairs. We advocate for extensive empirical evaluation of unsupervised MT systems to highlight failure points and encourage continued research on the most promising paradigms.
翻译:尽管据报未经监督的机器翻译(MT)取得了成功,但实地尚未审查这些方法成功的条件和失败之处。我们使用不同语言对口、不同领域、不同数据集和真实的低资源语言对未经监督的MT进行了广泛的实证评估。我们发现,当源和目标体来自不同领域时,业绩会迅速恶化,随机字嵌入初始化会大大影响下游翻译绩效。我们还发现,当源和目标语言使用不同的脚本时,未经监督的MT性能会下降,并观察到真实的低资源语言对口的表现非常差。我们主张对未经监督的MT系统进行广泛的实证评估,以突出失败点并鼓励对最有希望的范例进行持续研究。