消除多种语文变换者中最优化化的挑战 (Demystify Optimization Challenges in Multilingual Transformers) - 专知论文

会员服务 ·

0

优化器 · 局部曲率 · 可约的 · 曲率 · 变换 ·

2021 年 4 月 15 日

Demystify Optimization Challenges in Multilingual Transformers

翻译：消除多种语文变换者中最优化化的挑战

Xian Li,Hongyu Gong

Multilingual Transformer improves parameter efficiency and crosslingual transfer. How to effectively train multilingual models has not been well studied. Using multilingual machine translation as a testbed, we study optimization challenges from loss landscape and parameter plasticity perspectives. We found that imbalanced training data poses task interference between high and low resource languages, characterized by nearly orthogonal gradients for major parameters and the optimization trajectory being mostly dominated by high resource. We show that local curvature of the loss surface affects the degree of interference, and existing heuristics of data subsampling implicitly reduces the sharpness, although still face a trade-off between high and low resource languages. We propose a principled multi-objective optimization algorithm, Curvature Aware Task Scaling (CATS), which improves both optimization and generalization especially for low resource. Experiments on TED, WMT and OPUS-100 benchmarks demonstrate that CATS advances the Pareto front of accuracy while being efficient to apply to massive multilingual settings at the scale of 100 languages.

翻译：多语言变换器提高了参数效率和跨语言传输的参数效率。如何有效培训多语言模型还没有很好地研究。使用多语言机器翻译作为测试台,我们研究了损失景观和参数可塑性视角带来的优化挑战。我们发现,不平衡的培训数据在高资源语言和低资源语言之间造成了任务干扰,主要参数的特征是近正方形梯度,优化轨道大多以高资源为主。我们显示,损失表面的本地曲线会影响干扰程度,而现有的数据再抽样学隐含地降低了清晰度,尽管仍然面临着高资源语言和低资源语言之间的权衡。我们提出了一条原则性多目标优化算法(Curvature Se知任务缩放 CATS ), 改善优化和一般化,特别是对低资源而言。关于TED、WMT和OPUS-100基准的实验表明,CATS提高了Pareto的准确度前沿,同时有效地适用于100种语言规模的大规模多语言环境。

0

相关内容

优化器

【CVPR2021】预训练图像处理Transformer

专知会员服务

45+阅读 · 2021年6月1日

注意力机制综述

注意力机制综述

专知会员服务

207+阅读 · 2021年1月26日

【商汤科技】可变形Transformers端到端对象检测，Deformable DETR

【商汤科技】可变形Transformers端到端对象检测，Deformable DETR

专知会员服务

33+阅读 · 2020年10月11日

【ACL2020-Allen AI】预训练语言模型中的无监督域聚类

【ACL2020-Allen AI】预训练语言模型中的无监督域聚类

专知会员服务

24+阅读 · 2020年4月7日

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

专知会员服务

26+阅读 · 2020年3月19日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

【课程推荐】深度学习中的新兴挑战（Emerging Challenges in Deep Learning）

【课程推荐】深度学习中的新兴挑战（Emerging Challenges in Deep Learning）

专知会员服务

17+阅读 · 2019年11月10日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

Yoshua Bengio，使算法知道“为什么”

Yoshua Bengio，使算法知道“为什么”

专知会员服务

8+阅读 · 2019年10月10日

【Reformer】图解Reformer：一种高效的Transformer

【Reformer】图解Reformer：一种高效的Transformer

深度学习自然语言处理

6+阅读 · 2020年3月9日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

Arxiv

0+阅读 · 2021年6月9日

Federated Hyperparameter Tuning: Challenges, Baselines, and Connections to Weight-Sharing

Federated Hyperparameter Tuning: Challenges, Baselines, and Connections to Weight-Sharing

Arxiv

0+阅读 · 2021年6月8日

Glance-and-Gaze Vision Transformer

Arxiv

0+阅读 · 2021年6月4日

Improving the Robustness of QA Models to Challenge Sets with Variational Question-Answer Pair Generation

Arxiv

0+阅读 · 2021年6月4日

nmT5 -- Is parallel data still relevant for pre-training massively multilingual language models?

Arxiv

0+阅读 · 2021年6月3日

BiFair: Training Fair Models with Bilevel Optimization

Arxiv

0+阅读 · 2021年6月3日

Syntax-augmented Multilingual BERT for Cross-lingual Transfer

Arxiv

0+阅读 · 2021年6月3日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

The Evolved Transformer

The Evolved Transformer

Arxiv

5+阅读 · 2019年1月30日

Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model

Arxiv

7+阅读 · 2018年1月23日

VIP会员

文章信息

相关主题

相关VIP内容

【CVPR2021】预训练图像处理Transformer

专知会员服务

45+阅读 · 2021年6月1日

注意力机制综述

注意力机制综述

专知会员服务

207+阅读 · 2021年1月26日

【商汤科技】可变形Transformers端到端对象检测，Deformable DETR

【商汤科技】可变形Transformers端到端对象检测，Deformable DETR

专知会员服务

33+阅读 · 2020年10月11日

【ACL2020-Allen AI】预训练语言模型中的无监督域聚类

【ACL2020-Allen AI】预训练语言模型中的无监督域聚类

专知会员服务

24+阅读 · 2020年4月7日

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

专知会员服务

26+阅读 · 2020年3月19日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

【课程推荐】深度学习中的新兴挑战（Emerging Challenges in Deep Learning）

【课程推荐】深度学习中的新兴挑战（Emerging Challenges in Deep Learning）

专知会员服务

17+阅读 · 2019年11月10日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

Yoshua Bengio，使算法知道“为什么”

Yoshua Bengio，使算法知道“为什么”

专知会员服务

8+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

[ICML2025]当模型知识遇见扩散模型：扩散辅助的无数据图像合成及域与类别对齐

95页《深度研究DeepResearch的综合综述：系统、方法与应用》

【MIT博士论文】从数据到模型，再回到数据：构建可预测且可靠的机器学习系统”

何恺明CVPR最新讲座PPT上线《走向端到端生成建模》46页ppt

相关资讯

【Reformer】图解Reformer：一种高效的Transformer

【Reformer】图解Reformer：一种高效的Transformer

深度学习自然语言处理

6+阅读 · 2020年3月9日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

Arxiv

0+阅读 · 2021年6月9日

Federated Hyperparameter Tuning: Challenges, Baselines, and Connections to Weight-Sharing

Federated Hyperparameter Tuning: Challenges, Baselines, and Connections to Weight-Sharing

Arxiv

0+阅读 · 2021年6月8日

Glance-and-Gaze Vision Transformer

Arxiv

0+阅读 · 2021年6月4日

Improving the Robustness of QA Models to Challenge Sets with Variational Question-Answer Pair Generation

Arxiv

0+阅读 · 2021年6月4日

nmT5 -- Is parallel data still relevant for pre-training massively multilingual language models?

Arxiv

0+阅读 · 2021年6月3日

BiFair: Training Fair Models with Bilevel Optimization

Arxiv

0+阅读 · 2021年6月3日

Syntax-augmented Multilingual BERT for Cross-lingual Transfer

Arxiv

0+阅读 · 2021年6月3日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

The Evolved Transformer

The Evolved Transformer

Arxiv

5+阅读 · 2019年1月30日

Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model

Arxiv

7+阅读 · 2018年1月23日

微信扫码咨询专知VIP会员