在平行公司上通过精调嵌入式嵌入对齐单词 (Word Alignment by Fine-tuning Embeddings on Parallel Corpora) - 专知论文

会员服务 ·

0

Performer · MoDELS · 词向量表示 · 语言模型化 · state-of-the-art ·

2021 年 1 月 24 日

Word Alignment by Fine-tuning Embeddings on Parallel Corpora

翻译：在平行公司上通过精调嵌入式嵌入对齐单词

Zi-Yi Dou,Graham Neubig

from arxiv, EACL 2021

Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs. The great majority of past work on word alignment has worked by performing unsupervised learning on parallel texts. Recently, however, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data. In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing methods to effectively extract alignments from these fine-tuned models. We perform experiments on five language pairs and demonstrate that our model can consistently outperform previous state-of-the-art models of all varieties. In addition, we demonstrate that we are able to train multilingual word aligners that can obtain robust performance on different language pairs. Our aligner, AWESOME (Aligning Word Embedding Spaces of Multilingual Encoders), with pre-trained models is available at https://github.com/neulab/awesome-align

翻译：在平行的Corpora上,Word对齐有各种各样的应用,包括学习翻译词汇、跨语言语言语言传输语言处理工具以及自动评价或分析翻译结果。过去关于字对齐的工作绝大多数是通过对平行文本进行未经监督的学习而起作用的。然而,最近,其他工作表明,从多语种培训的语言模式(LMS)中产生的经过事先培训的背景化字嵌入证明是一个有吸引力的替代办法,即使在没有关于平行数据的明确培训的情况下,也能够在字对齐任务上取得竞争性结果。在本文中,我们研究将两种方法结合起来的方法:利用预先培训的LMS,但用旨在提高校准质量的目标对平行文本进行微调,提出从这些经过精细调整的模型中有效提取校准的方法。我们在五对语言进行实验,并表明我们的模型能够始终超越以往所有品种的状态和艺术模式。此外,我们证明,我们能够培训能够在不同语言配对上取得强性表现的多语言对齐词调调。我们的校准者,AWESOMME(将WO Embing Stapding Stapledding Staplening Stapleningspace-spaces of amles/ amles/ MAlogwebour)

0

相关内容

Performer

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【ACMMM2020】小规模行人检测的自模拟学习

【ACMMM2020】小规模行人检测的自模拟学习

专知会员服务

15+阅读 · 2020年9月25日

【ACMMM2020】条件推理的医学视觉问答

【ACMMM2020】条件推理的医学视觉问答

专知会员服务

39+阅读 · 2020年9月9日

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

专知会员服务

46+阅读 · 2020年4月25日

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

专知会员服务

23+阅读 · 2020年4月21日

【斯坦福大学AI】BERT, ELMo， & GPT-2:上下文化的单词表示是怎样的?

专知会员服务

35+阅读 · 2020年3月28日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

专知会员服务

25+阅读 · 2019年12月26日

【AAAI2020接受论文】Emu:使用语义专门化增强多语言句子嵌入，Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

【AAAI2020接受论文】Emu:使用语义专门化增强多语言句子嵌入，Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

专知会员服务

26+阅读 · 2019年11月11日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

动物脑的好奇心和强化学习的好奇心

动物脑的好奇心和强化学习的好奇心

CreateAMind

10+阅读 · 2019年1月26日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【推荐】Python机器学习生态圈(Scikit-Learn相关项目)

【推荐】Python机器学习生态圈(Scikit-Learn相关项目)

机器学习研究会

6+阅读 · 2017年8月23日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

自然语言处理 (三)　之　word embedding

自然语言处理 (三)　之　word embedding

DeepLearning中文论坛

19+阅读 · 2015年8月3日

A Critical Assessment of State-of-the-Art in Entity Alignment

Arxiv

0+阅读 · 2021年3月17日

SML: a new Semantic Embedding Alignment Transformer for efficient cross-lingual Natural Language Inference

Arxiv

0+阅读 · 2021年3月17日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation

On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation

Arxiv

3+阅读 · 2018年9月11日

Unsupervised Multilingual Word Embeddings

Arxiv

3+阅读 · 2018年8月27日

Learned in Translation: Contextualized Word Vectors

Arxiv

6+阅读 · 2018年6月20日

Universal Language Model Fine-tuning for Text Classification

Arxiv

3+阅读 · 2018年5月23日

Unsupervised Machine Translation Using Monolingual Corpora Only

Arxiv

5+阅读 · 2018年4月13日

Word Translation Without Parallel Data

Arxiv

7+阅读 · 2018年1月30日

VIP会员

文章信息

相关主题

词向量表示

语言模型化

state-of-the-art

相关VIP内容

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【ACMMM2020】小规模行人检测的自模拟学习

【ACMMM2020】小规模行人检测的自模拟学习

专知会员服务

15+阅读 · 2020年9月25日

【ACMMM2020】条件推理的医学视觉问答

【ACMMM2020】条件推理的医学视觉问答

专知会员服务

39+阅读 · 2020年9月9日

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

专知会员服务

46+阅读 · 2020年4月25日

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

专知会员服务

23+阅读 · 2020年4月21日

【斯坦福大学AI】BERT, ELMo， & GPT-2:上下文化的单词表示是怎样的?

专知会员服务

35+阅读 · 2020年3月28日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

专知会员服务

25+阅读 · 2019年12月26日

【AAAI2020接受论文】Emu:使用语义专门化增强多语言句子嵌入，Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

【AAAI2020接受论文】Emu:使用语义专门化增强多语言句子嵌入，Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

专知会员服务

26+阅读 · 2019年11月11日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《主动式社会工程防御（ASED）项目》美空军24页项目报告

《军事域人工智能及其对国际影响：未来政策行动路线图》2025最新31页报告

《假新闻检测的特征计算流程：基于大语言模型的提取方法》

《多视角时空一致多模态感知目标检测的对抗鲁棒性研究》DARPA赞助最新96页技术报告

相关资讯

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

动物脑的好奇心和强化学习的好奇心

动物脑的好奇心和强化学习的好奇心

CreateAMind

10+阅读 · 2019年1月26日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【推荐】Python机器学习生态圈(Scikit-Learn相关项目)

【推荐】Python机器学习生态圈(Scikit-Learn相关项目)

机器学习研究会

6+阅读 · 2017年8月23日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

自然语言处理 (三)　之　word embedding

自然语言处理 (三)　之　word embedding

DeepLearning中文论坛

19+阅读 · 2015年8月3日

相关论文

A Critical Assessment of State-of-the-Art in Entity Alignment

Arxiv

0+阅读 · 2021年3月17日

SML: a new Semantic Embedding Alignment Transformer for efficient cross-lingual Natural Language Inference

Arxiv

0+阅读 · 2021年3月17日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation

On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation

Arxiv

3+阅读 · 2018年9月11日

Unsupervised Multilingual Word Embeddings

Arxiv

3+阅读 · 2018年8月27日

Learned in Translation: Contextualized Word Vectors

Arxiv

6+阅读 · 2018年6月20日

Universal Language Model Fine-tuning for Text Classification

Arxiv

3+阅读 · 2018年5月23日

Unsupervised Machine Translation Using Monolingual Corpora Only

Arxiv

5+阅读 · 2018年4月13日

Word Translation Without Parallel Data

Arxiv

7+阅读 · 2018年1月30日

微信扫码咨询专知VIP会员