Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs. The great majority of past work on word alignment has worked by performing unsupervised learning on parallel texts. Recently, however, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data. In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing methods to effectively extract alignments from these fine-tuned models. We perform experiments on five language pairs and demonstrate that our model can consistently outperform previous state-of-the-art models of all varieties. In addition, we demonstrate that we are able to train multilingual word aligners that can obtain robust performance on different language pairs. Our aligner, AWESOME (Aligning Word Embedding Spaces of Multilingual Encoders), with pre-trained models is available at https://github.com/neulab/awesome-align
翻译:在平行的Corpora上,Word对齐有各种各样的应用,包括学习翻译词汇、跨语言语言语言传输语言处理工具以及自动评价或分析翻译结果。过去关于字对齐的工作绝大多数是通过对平行文本进行未经监督的学习而起作用的。然而,最近,其他工作表明,从多语种培训的语言模式(LMS)中产生的经过事先培训的背景化字嵌入证明是一个有吸引力的替代办法,即使在没有关于平行数据的明确培训的情况下,也能够在字对齐任务上取得竞争性结果。在本文中,我们研究将两种方法结合起来的方法:利用预先培训的LMS,但用旨在提高校准质量的目标对平行文本进行微调,提出从这些经过精细调整的模型中有效提取校准的方法。我们在五对语言进行实验,并表明我们的模型能够始终超越以往所有品种的状态和艺术模式。此外,我们证明,我们能够培训能够在不同语言配对上取得强性表现的多语言对齐词调调。我们的校准者,AWESOMME(将WO Embing Stapding Stapledding Staplening Stapleningspace-spaces of amles/ amles/ MAlogwebour)