The University of Edinburgh participated in the WMT19 Shared Task on News Translation in six language directions: English-to-Gujarati, Gujarati-to-English, English-to-Chinese, Chinese-to-English, German-to-English, and English-to-Czech. For all translation directions, we created or used back-translations of monolingual data in the target language as additional synthetic training data. For English-Gujarati, we also explored semi-supervised MT with cross-lingual language model pre-training, and translation pivoting through Hindi. For translation to and from Chinese, we investigated character-based tokenisation vs. sub-word segmentation of Chinese text. For German-to-English, we studied the impact of vast amounts of back-translated training data on translation quality, gaining a few additional insights over Edunov et al. (2018). For English-to-Czech, we compared different pre-processing and tokenisation regimes.
翻译:爱丁堡大学参加了WMT19新闻翻译共同任务六种语言方向:英语到Gujarati、古吉拉特语到英语、英语到华、中文到英语、德语到英语和英语到捷克语。关于所有翻译方向,我们创建或使用了或使用了目标语言单语数据背译作为补充合成培训数据。关于英语到Gujarati,我们还探索了半监督的MT,使用跨语语言模式培训前培训,并通过印地语进行翻译。关于中文的翻译,我们研究了基于字符的象征性化与中文文本的次词分割。关于德文到英语,我们研究了大量反翻译培训数据对翻译质量的影响,获得了关于Edunov等人(2018年)的更多见解。关于英语到捷克语,我们比较了不同的预处理和代号化制度。