Modern Machine Translation (MT) systems perform consistently well on clean, in-domain text. However most human generated text, particularly in the realm of social media, is full of typos, slang, dialect, idiolect and other noise which can have a disastrous impact on the accuracy of output translation. In this paper we leverage the Machine Translation of Noisy Text (MTNT) dataset to enhance the robustness of MT systems by emulating naturally occurring noise in otherwise clean data. Synthesizing noise in this manner we are ultimately able to make a vanilla MT system resilient to naturally occurring noise and partially mitigate loss in accuracy resulting therefrom.
翻译:现代机器翻译(MT)系统在清洁、内部文本方面始终运作良好。然而,大多数人类生成的文本,特别是在社交媒体领域,都充满了打字、 slang、方言、异性和其他噪音,可能对产出翻译的准确性产生灾难性影响。在本文中,我们利用《噪音文本(MTNT)机译数据集》,通过在其他清洁数据中模仿自然产生的噪音来增强MT系统的稳健性。以这种方式合成噪音,我们最终能够使香草MT系统适应自然产生的噪音,并部分减少由此造成的准确性损失。