机器翻译中的多语言毒性问题 (Toxicity in Multilingual Machine Translation at Scale) - 专知论文

会员服务 ·

0

机器翻译 · 统计学 · 毒性评估 · 取向 · 低资源 ·

2023 年 4 月 5 日

Toxicity in Multilingual Machine Translation at Scale

翻译：机器翻译中的多语言毒性问题

Marta R. Costa-jussà,Eric Smith,Christophe Ropers,Daniel Licht,Jean Maillard,Javier Ferrando,Carlos Escolano

Machine Translation systems can produce different types of errors, some of which are characterized as critical or catastrophic due to the specific negative impact that they can have on users. In this paper we focus on one type of critical error: added toxicity. We evaluate and analyze added toxicity when translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences, covering 13 demographic axes) from English into 164 languages. An automatic toxicity evaluation shows that added toxicity across languages varies from 0% to 5%. The output languages with the most added toxicity tend to be low-resource ones, and the demographic axes with the most added toxicity include sexual orientation, gender and sex, and ability. We also perform human evaluation on a subset of 8 translation directions, confirming the prevalence of true added toxicity. We use a measurement of the amount of source contribution to the translation, where a low source contribution implies hallucination, to interpret what causes toxicity. Making use of the input attributions allows us to explain toxicity, because the source contributions significantly correlate with toxicity for 84% of languages studied. Given our findings, our recommendations to reduce added toxicity are to curate training data to avoid mistranslations, mitigate hallucination and check unstable translations.

翻译：机器翻译系统会出现各种误译，其中一些被归类为关键或灾难性错误，因为它们可能对用户产生负面影响。本文侧重于一种关键性错误：增加毒性。我们对大型评估数据集（HOLISTICBIAS，超过472k句子，涵盖13个人口统计学轴）从英文翻译为164种语言时添加毒性进行评估和分析。自动毒性评估显示，跨语言的附加毒性从0％到5％不等。附加毒性最严重的输出语言往往是低资源语言，附加毒性最严重的人口统计学轴包括性取向、性别和能力。我们还对8个翻译方向的子集进行人工评估，确认了确实存在附加毒性。我们使用源贡献量的测量方法来解释毒性，其中较低的源贡献量意味着产生了幻觉。充分利用输入归因使我们能够解释毒性，因为84％的研究语言中，源贡献显著地与毒性相关。鉴于我们的发现，我们建议采取措施，减少增加毒性，包括筛选训练数据以避免错误翻译、减轻幻觉和检查不稳定的翻译。

0

相关内容

机器翻译

机器翻译，又称为自动翻译，是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程。它是计算语言学的一个分支，是人工智能的终极目标之一，具有重要的科学研究价值。

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等

【EMNLP2021教程】鲁棒自然语言处理，EMNLP 21 Tutorial on Robust NLP，176页pdf

【EMNLP2021教程】鲁棒自然语言处理，EMNLP 21 Tutorial on Robust NLP，176页pdf

专知会员服务

35+阅读 · 2021年11月12日

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【新书】人工智能Python代码，227页pdf，Python code for Artificial Intelligence: Foundations of Computational Agents

【新书】人工智能Python代码，227页pdf，Python code for Artificial Intelligence: Foundations of Computational Agents

专知会员服务

103+阅读 · 2020年6月21日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

开放知识图谱

2+阅读 · 2022年5月20日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

THU数据派

11+阅读 · 2019年3月25日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新八篇情感分析相关论文—注意力网络、多模态情感分析、情感分析局限性、跨语言情感分类、多语言情感分析

【论文推荐】最新八篇情感分析相关论文—注意力网络、多模态情感分析、情感分析局限性、跨语言情感分类、多语言情感分析

专知

52+阅读 · 2018年6月28日

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

专知

15+阅读 · 2018年5月1日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

Pygo2在乳腺癌多药耐药中的作用及其上下游分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

细胞周期蛋白Cyclin G1与肿瘤分子靶向治疗诱导多倍体耐药的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

厚果崖豆藤中新型微管抑制剂Pachycarpaone的微管抑制机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

含超临界指数的非线性椭圆方程的变分及发展方程动力系统问题研究

国家自然科学基金

0+阅读 · 2013年12月31日

哈密尔顿系统的高效的辛和多辛算法

国家自然科学基金

0+阅读 · 2012年12月31日

PM2.5细颗粒毒性机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

发光二极管LED非相干宽带腔增强吸收光谱技术对大气HONO的定量方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

混合策略的机器翻译方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

受体相互作用蛋白3在阿霉素引起的心脏损伤中的作用及其机制研究

国家自然科学基金

0+阅读 · 2010年12月31日

基于语言理解的机器翻译方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

Revisiting Machine Translation for Cross-lingual Classification

Arxiv

0+阅读 · 2023年5月23日

Beyond Shared Vocabulary: Increasing Representational Word Similarities across Languages for Multilingual Machine Translation

Arxiv

0+阅读 · 2023年5月23日

Probing in Context: Toward Building Robust Classifiers via Probing Large Language Models

Arxiv

0+阅读 · 2023年5月23日

In-context Example Selection for Machine Translation Using Multiple Features

Arxiv

0+阅读 · 2023年5月23日

Pseudo-Label Training and Model Inertia in Neural Machine Translation

Arxiv

0+阅读 · 2023年5月19日

Optimal Transport for Unsupervised Hallucination Detection in Neural Machine Translation

Arxiv

0+阅读 · 2023年5月19日

ReSeTOX: Re-learning attention weights for toxicity mitigation in machine translation

Arxiv

0+阅读 · 2023年5月19日

Viewing Knowledge Transfer in Multilingual Machine Translation Through a Representational Lens

Arxiv

0+阅读 · 2023年5月19日

In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

Arxiv

0+阅读 · 2023年5月18日

An Overview on Machine Translation Evaluation

An Overview on Machine Translation Evaluation

Arxiv

14+阅读 · 2022年2月22日

VIP会员

文章信息

相关主题

相关VIP内容

【EMNLP2021教程】鲁棒自然语言处理，EMNLP 21 Tutorial on Robust NLP，176页pdf

【EMNLP2021教程】鲁棒自然语言处理，EMNLP 21 Tutorial on Robust NLP，176页pdf

专知会员服务

35+阅读 · 2021年11月12日

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【新书】人工智能Python代码，227页pdf，Python code for Artificial Intelligence: Foundations of Computational Agents

【新书】人工智能Python代码，227页pdf，Python code for Artificial Intelligence: Foundations of Computational Agents

专知会员服务

103+阅读 · 2020年6月21日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

数据要素发展报告(2025年)：附下载

人工智能代理提升战时舰船战备水平

【NeurIPS2025教程】大语言模型规划

NeurIPS 2025 教程：深度学习训练不稳定性的理论洞见

相关资讯

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

开放知识图谱

2+阅读 · 2022年5月20日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

THU数据派

11+阅读 · 2019年3月25日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新八篇情感分析相关论文—注意力网络、多模态情感分析、情感分析局限性、跨语言情感分类、多语言情感分析

【论文推荐】最新八篇情感分析相关论文—注意力网络、多模态情感分析、情感分析局限性、跨语言情感分类、多语言情感分析

专知

52+阅读 · 2018年6月28日

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

专知

15+阅读 · 2018年5月1日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

相关论文

Revisiting Machine Translation for Cross-lingual Classification

Arxiv

0+阅读 · 2023年5月23日

Beyond Shared Vocabulary: Increasing Representational Word Similarities across Languages for Multilingual Machine Translation

Arxiv

0+阅读 · 2023年5月23日

Probing in Context: Toward Building Robust Classifiers via Probing Large Language Models

Arxiv

0+阅读 · 2023年5月23日

In-context Example Selection for Machine Translation Using Multiple Features

Arxiv

0+阅读 · 2023年5月23日

Pseudo-Label Training and Model Inertia in Neural Machine Translation

Arxiv

0+阅读 · 2023年5月19日

Optimal Transport for Unsupervised Hallucination Detection in Neural Machine Translation

Arxiv

0+阅读 · 2023年5月19日

ReSeTOX: Re-learning attention weights for toxicity mitigation in machine translation

Arxiv

0+阅读 · 2023年5月19日

Viewing Knowledge Transfer in Multilingual Machine Translation Through a Representational Lens

Arxiv

0+阅读 · 2023年5月19日

In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

Arxiv

0+阅读 · 2023年5月18日

An Overview on Machine Translation Evaluation

An Overview on Machine Translation Evaluation

Arxiv

14+阅读 · 2022年2月22日

相关基金

Pygo2在乳腺癌多药耐药中的作用及其上下游分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

细胞周期蛋白Cyclin G1与肿瘤分子靶向治疗诱导多倍体耐药的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

厚果崖豆藤中新型微管抑制剂Pachycarpaone的微管抑制机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

含超临界指数的非线性椭圆方程的变分及发展方程动力系统问题研究

国家自然科学基金

0+阅读 · 2013年12月31日

哈密尔顿系统的高效的辛和多辛算法

国家自然科学基金

0+阅读 · 2012年12月31日

PM2.5细颗粒毒性机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

发光二极管LED非相干宽带腔增强吸收光谱技术对大气HONO的定量方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

混合策略的机器翻译方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

受体相互作用蛋白3在阿霉素引起的心脏损伤中的作用及其机制研究

国家自然科学基金

0+阅读 · 2010年12月31日

基于语言理解的机器翻译方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员