In this work, we introduce the construction of a machine translation (MT) assisted and human-in-the-loop multilingual parallel corpus with annotations of multi-word expressions (MWEs), named AlphaMWE. The MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manually. The languages covered include Arabic, Chinese, English, German, Italian, and Polish, of which, the Arabic corpus includes both standard and dialectal variations from Egypt and Tunisia. Our original English corpus is extracted from the PARSEME shared task in 2018. We performed machine translation of this source corpus followed by human post-editing and annotation of target MWEs. Strict quality control was applied for error limitation, i.e., each MT output sentence received first manual post-editing and annotation plus a second manual quality rechecking till annotators' consensus is reached. One of our findings during corpora preparation is that accurate translation of MWEs presents challenges to MT systems, as reflected by the outcomes of human-in-the-loop metric HOPE. To facilitate further MT research, we present a categorisation of the error types encountered by MT systems in performing MWE-related translation. To acquire a broader view of MT issues, we selected four popular state-of-the-art MT systems for comparison, namely Microsoft Bing Translator, GoogleMT, Baidu Fanyi, and DeepL MT. Because of the noise removal, translation post-editing, and MWE annotation by human professionals, we believe the AlphaMWE data set will be an asset for both monolingual and cross-lingual research, such as multi-word term lexicography, MT, and information extraction.
翻译:本研究介绍了一种名为AlphaMWE的机器翻译辅助与人机协同的多语言平行语料库构建方法,该语料库标注了多词表达。这些多词表达包括PARSEME共享任务中定义的以动词为核心的动词性多词表达。标注的动词性多词表达还经过人工的双语及多语对齐处理。所涵盖的语言包括阿拉伯语、汉语、英语、德语、意大利语和波兰语,其中阿拉伯语语料库包含埃及和突尼斯的标准语及方言变体。我们的原始英语语料库提取自2018年PARSEME共享任务。我们对源语料库进行机器翻译,随后进行人工译后编辑和目标多词表达标注。为控制误差,我们实施了严格的质量控制流程:每个机器翻译输出句子首先经过人工译后编辑与标注,再进行二次人工质量复核,直至标注者达成一致。语料库构建过程中的一项发现是,多词表达的准确翻译对机器翻译系统构成挑战,这体现在人机协同评估指标HOPE的结果中。为促进机器翻译研究,我们对机器翻译系统在多词表达相关翻译中遇到的错误类型进行了分类。为更全面考察机器翻译问题,我们选取了四种主流先进机器翻译系统进行比较,即Microsoft Bing Translator、GoogleMT、百度翻译和DeepL MT。通过人工专业处理实现的噪声去除、翻译后编辑及多词表达标注,我们相信AlphaMWE数据集将成为单语与跨语言研究(如多词术语词典编纂、机器翻译和信息抽取)的重要资源。