缩小无人监督神经机器翻译培训与推断之间的数据差距 (Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation)

Back-translation is a critical component of Unsupervised Neural Machine Translation (UNMT), which generates pseudo parallel data from target monolingual data. A UNMT model is trained on the pseudo parallel data with translated source, and translates natural source sentences in inference. The source discrepancy between training and inference hinders the translation performance of UNMT models. By carefully designing experiments, we identify two representative characteristics of the data gap in source: (1) style gap (i.e., translated vs. natural text style) that leads to poor generalization capability; (2) content gap that induces the model to produce hallucination content biased towards the target language. To narrow the data gap, we propose an online self-training approach, which simultaneously uses the pseudo parallel data {natural source, translated target} to mimic the inference scenario. Experimental results on several widely-used language pairs show that our approach outperforms two strong baselines (XLM and MASS) by remedying the style and content gaps.

翻译：后译是无人监督的神经机器翻译(UNMT)的关键组成部分,它从目标单语数据中产生假平行数据。UNMT模型用翻译源的假平行数据进行培训,并将自然源句翻译为推论。培训和推论之间的来源差异妨碍了UNMT模型的翻译工作。通过仔细设计实验,我们确定了源数据差距的两种代表性特征:(1) 风格差距(即翻译与自然文本样式)导致普遍化能力差;(2) 导致模型产生偏向目标语言的幻觉内容的内容差距。为了缩小数据差距,我们提议了在线自我培训方法,同时使用伪平行数据{自然源,翻译目标}来模拟推断假设情景。几个广泛使用的语言配对的实验结果显示,我们的方法通过弥补风格和内容差距,超越了两个强大的基线(XLM和MASS)。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【Max Welling】图神经网络知识表示与推荐，Graph Neural Networks for Knowledge Representation and Recommendation

专知会员服务

44+阅读 · 2022年3月4日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

80+阅读 · 2020年7月26日

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日

因果图，Causal Graphs，52页ppt

专知会员服务

250+阅读 · 2020年4月19日