语言建模 (Provably Confidential Language Modelling) - 专知论文

会员服务 ·

0

语言模型化 · MoDELS · Perplexity · INFORMS · 生成模型 ·

2022 年 5 月 4 日

Provably Confidential Language Modelling

翻译：语言建模

Xuandong Zhao,Lei Li,Yu-Xiang Wang

from arxiv, NAACL 2022

Large language models are shown to memorize privacy information such as social security numbers in training data. Given the sheer scale of the training corpus, it is challenging to screen and filter these privacy data, either manually or automatically. In this paper, we propose Confidentially Redacted Training (CRT), a method to train language generation models while protecting the confidential segments. We borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. Moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. We implement the method for both LSTM and GPT language models. Our experimental results show that the models trained by CRT obtain almost the same perplexity while preserving strong confidentiality.

翻译：大型语言模型可以记住隐私信息,例如培训数据中的社会保障数字。鉴于培训资料的范围之大,对人工或自动筛选和过滤这些隐私数据具有挑战性。在本文中,我们提议采用保密的再培训方法来培训语言生成模型,同时保护机密部分。我们从差异隐私(这解决了一个相关但独特的问题)中借出一些想法,并表明我们的方法能够通过随机调整部分培训过程来防止意外的记忆化。此外,我们显示,经过修改的大致正确的筛选政策会扩大保密性保障。我们实施了LSTM和GPT语言模型的方法。我们的实验结果表明,CRT所培训的模式在保持强有力的保密性的同时,也得到了几乎相同的重复性。

0

相关内容

语言模型化

语言模型化

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

基于单语语料的无监督统计机器翻译模型研究

国家自然科学基金

1+阅读 · 2013年12月31日

三维成像雷达高度计海况偏差修正关键技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

气固非催化反应中固体产物介尺度结构的形成与生长

国家自然科学基金

0+阅读 · 2013年12月31日

低层错能镍基变形高温合金反常动态应变时效机理

国家自然科学基金

0+阅读 · 2011年12月31日

松节油基驱避剂的QSAR研究

国家自然科学基金

0+阅读 · 2008年12月31日

Global sensitivity analysis based on Gaussian-process metamodelling for complex biomechanical problems

Arxiv

0+阅读 · 2022年6月21日

Evolution through Large Models

Evolution through Large Models

Arxiv

0+阅读 · 2022年6月17日

SaDe: Learning Models that Provably Satisfy Domain Constraints

SaDe: Learning Models that Provably Satisfy Domain Constraints

Arxiv

0+阅读 · 2022年6月17日

A Hybrid Modelling Approach for Aerial Manipulators

Arxiv

0+阅读 · 2022年6月17日

On Integrating Prior Knowledge into Gaussian Processes for Prognostic Health Monitoring

Arxiv

0+阅读 · 2022年6月17日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《无人机系统 - 反无人机系统：测试方法》364页

《无人机蜂群攻击防御的预测建模：面向美军战备的人工智能轨迹预测与最优拦截策略设计》最新报告

美军低成本无人作战攻击系统（LUCAS）：扩大无人机战争规模

《将空中力量带向海洋：美国海军航空发展的四条竞争路径及其教训》报告

相关资讯

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

Global sensitivity analysis based on Gaussian-process metamodelling for complex biomechanical problems

Arxiv

0+阅读 · 2022年6月21日

Evolution through Large Models

Evolution through Large Models

Arxiv

0+阅读 · 2022年6月17日

SaDe: Learning Models that Provably Satisfy Domain Constraints

SaDe: Learning Models that Provably Satisfy Domain Constraints

Arxiv

0+阅读 · 2022年6月17日

A Hybrid Modelling Approach for Aerial Manipulators

Arxiv

0+阅读 · 2022年6月17日

On Integrating Prior Knowledge into Gaussian Processes for Prognostic Health Monitoring

Arxiv

0+阅读 · 2022年6月17日

相关基金

基于单语语料的无监督统计机器翻译模型研究

国家自然科学基金

1+阅读 · 2013年12月31日

三维成像雷达高度计海况偏差修正关键技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

气固非催化反应中固体产物介尺度结构的形成与生长

国家自然科学基金

0+阅读 · 2013年12月31日

低层错能镍基变形高温合金反常动态应变时效机理

国家自然科学基金

0+阅读 · 2011年12月31日

松节油基驱避剂的QSAR研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员