Back-translation is a critical component of Unsupervised Neural Machine Translation (UNMT), which generates pseudo parallel data from target monolingual data. A UNMT model is trained on the pseudo parallel data with translated source, and translates natural source sentences in inference. The source discrepancy between training and inference hinders the translation performance of UNMT models. By carefully designing experiments, we identify two representative characteristics of the data gap in source: (1) style gap (i.e., translated vs. natural text style) that leads to poor generalization capability; (2) content gap that induces the model to produce hallucination content biased towards the target language. To narrow the data gap, we propose an online self-training approach, which simultaneously uses the pseudo parallel data {natural source, translated target} to mimic the inference scenario. Experimental results on several widely-used language pairs show that our approach outperforms two strong baselines (XLM and MASS) by remedying the style and content gaps.
翻译:后译是无人监督的神经机器翻译(UNMT)的关键组成部分,它从目标单语数据中产生假平行数据。UNMT模型用翻译源的假平行数据进行培训,并将自然源句翻译为推论。培训和推论之间的来源差异妨碍了UNMT模型的翻译工作。通过仔细设计实验,我们确定了源数据差距的两种代表性特征:(1) 风格差距(即翻译与自然文本样式)导致普遍化能力差;(2) 导致模型产生偏向目标语言的幻觉内容的内容差距。为了缩小数据差距,我们提议了在线自我培训方法,同时使用伪平行数据{自然源,翻译目标}来模拟推断假设情景。几个广泛使用的语言配对的实验结果显示,我们的方法通过弥补风格和内容差距,超越了两个强大的基线(XLM和MASS)。