We consider the problem of making machine translation more robust to character-level variation at the source side, such as typos. Existing methods achieve greater coverage by applying subword models such as byte-pair encoding (BPE) and character-level encoders, but these methods are highly sensitive to spelling mistakes. We show how training on a mild amount of random synthetic noise can dramatically improve robustness to these variations, without diminishing performance on clean text. We focus on translation performance on natural noise, as captured by frequent corrections in Wikipedia edit logs, and show that robustness to such noise can be achieved using a balanced diet of simple synthetic noises at training time, without access to the natural noise data or distribution.
翻译:我们考虑了使机器翻译在源头(如伤寒等)对性格差异更加稳健的问题。现有方法通过应用字节编码和字符级编码器等子词模型实现更大的覆盖范围,但这些方法对拼写错误非常敏感。我们展示了关于少量随机合成噪音的培训如何能够大大增强这些差异的稳健性,同时又不降低清洁文本的性能。我们注重自然噪音的翻译性能,维基百科编辑日志中经常校正所捕捉到的自然噪音的翻译性能,并表明在培训时,在没有自然噪音数据或分布的情况下,使用简单合成噪音的均衡饮食,可以实现对此类噪音的稳健性。