自调超参数的跨语种无监督分词 (Self-tuning hyper-parameters for unsupervised cross-lingual tokenization) - 专知论文

会员服务 ·

0

分词 · 无监督 · 压缩因子 · 监督 · 超参数 ·

2023 年 4 月 4 日

Self-tuning hyper-parameters for unsupervised cross-lingual tokenization

翻译：自调超参数的跨语种无监督分词

from arxiv, 5 figures, 2 tables, submitted to KONT-2023 conference

We explore the possibility of meta-learning for the language-independent unsupervised tokenization problem for English, Russian, and Chinese. We implement the meta-learning approach for automatic determination of hyper-parameters of the unsupervised tokenization model proposed in earlier works, relying on various human-independent fitness functions such as normalised anti-entropy, compression factor and cross-split F1 score, as well as additive and multiplicative composite combinations of the three metrics, testing them against the conventional F1 tokenization score. We find a fairly good correlation between the latter and the additive combination of the former three metrics for English and Russian. In case of Chinese, we find a significant correlation between the F 1 score and the compression factor. Our results suggest the possibility of robust unsupervised tokenization of low-resource and dead languages and allow us to think about human languages in terms of the evolution of efficient symbolic communication codes with different structural optimisation schemes that have evolved in different human cultures.

翻译：我们探讨了元学习方法在英语、俄语和中文语言独立无监督分词问题上的可能性。我们实现了元学习方法，自动确定无监督分词模型的超参数，这个模型是在之前的研究中提出的，依靠各种不同的与人无关的适应度函数，例如标准化反熵、压缩因子和交叉分裂 F1 分数，以及三个度量的加性和乘性组合，将它们与传统的 F1 分数进行比较。我们发现，在英语和俄语的情况下，后三项度量的加性组合与 F1 分数之间存在着相当良好的相关性。在中文的情况下，我们发现 F1 分数与压缩因子之间存在着显著的相关性。我们的研究结果表明，能够对低资源和死语中进行无监督分词，并允许我们从不同的结构优化方案的角度思考人类语言的演变，这些方案在不同的人类文化中演化出了高效的符号通信编码。

0

相关内容

将一个汉字序列切分成一个一个单独的词

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

72+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

97+阅读 · 2022年2月10日

EMNLP 2021 | 预训练跨语言模型中的大词表构建及使用

EMNLP 2021 | 预训练跨语言模型中的大词表构建及使用

专知会员服务

20+阅读 · 2022年1月5日

【EMNLP2020】自然语言生成，Neural Language Generation

【EMNLP2020】自然语言生成，Neural Language Generation

专知会员服务

38+阅读 · 2020年11月20日

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

专知会员服务

45+阅读 · 2020年4月25日

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

专知会员服务

21+阅读 · 2020年4月21日

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

专知会员服务

16+阅读 · 2020年4月10日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

25+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

161+阅读 · 2020年3月18日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

35+阅读 · 2020年3月3日

【Github】All4NLP：自然语言处理相关资源整理

【Github】All4NLP：自然语言处理相关资源整理

AINLP

23+阅读 · 2019年8月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

39+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

23+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

25+阅读 · 2019年5月18日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

27+阅读 · 2019年4月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

17+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

25+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

专知

15+阅读 · 2018年5月1日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

36+阅读 · 2018年2月21日

小胶质细胞过度激活在精神分裂症阴性症状中作用的信号通路研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于非独立同分布学习理论的图模型词义消歧及领域适应方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

演化优化的自适应约束处理机理及在生化过程中的应用

国家自然科学基金

0+阅读 · 2015年12月31日

Anderson型多酸的不对称修饰及可控组装研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于赫依三穴研究蒙医灸疗对PCPA致失眠模型大鼠镇静催眠作用机理研究

国家自然科学基金

0+阅读 · 2014年12月31日

雌激素上调酸敏感离子通道：疼痛性别差异的一个新分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

无监督分词及词性归纳联合方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

超声波振动辅助高密度倒装芯片塑封下填充工艺与机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

带约束和参数的多变量逼近的理论与方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

湿氧化爆破法处理木质纤维素的降解动力学的研究及评价

国家自然科学基金

0+阅读 · 2010年12月31日

Boosting Cross-lingual Transferability in Multilingual Models via In-Context Learning

Arxiv

0+阅读 · 2023年5月24日

Towards Adaptive Prefix Tuning for Parameter-Efficient Language Model Fine-tuning

Arxiv

0+阅读 · 2023年5月24日

On Degrees of Freedom in Defining and Testing Natural Language Understanding

Arxiv

0+阅读 · 2023年5月24日

Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Arxiv

1+阅读 · 2023年5月24日

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Arxiv

0+阅读 · 2023年5月24日

LLM-powered Data Augmentation for Enhanced Crosslingual Performance

Arxiv

0+阅读 · 2023年5月23日

UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers

UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers

Arxiv

0+阅读 · 2023年5月22日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources

Arxiv

13+阅读 · 2019年11月14日

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Arxiv

11+阅读 · 2019年10月30日

VIP会员

文章信息

相关主题

相关VIP内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

72+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

97+阅读 · 2022年2月10日

EMNLP 2021 | 预训练跨语言模型中的大词表构建及使用

EMNLP 2021 | 预训练跨语言模型中的大词表构建及使用

专知会员服务

20+阅读 · 2022年1月5日

【EMNLP2020】自然语言生成，Neural Language Generation

【EMNLP2020】自然语言生成，Neural Language Generation

专知会员服务

38+阅读 · 2020年11月20日

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

专知会员服务

45+阅读 · 2020年4月25日

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

专知会员服务

21+阅读 · 2020年4月21日

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

专知会员服务

16+阅读 · 2020年4月10日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

25+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

161+阅读 · 2020年3月18日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

35+阅读 · 2020年3月3日

热门VIP内容

相关资讯

【Github】All4NLP：自然语言处理相关资源整理

【Github】All4NLP：自然语言处理相关资源整理

AINLP

23+阅读 · 2019年8月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

39+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

23+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

25+阅读 · 2019年5月18日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

27+阅读 · 2019年4月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

17+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

25+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

专知

15+阅读 · 2018年5月1日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

36+阅读 · 2018年2月21日

相关论文

Boosting Cross-lingual Transferability in Multilingual Models via In-Context Learning

Arxiv

0+阅读 · 2023年5月24日

Towards Adaptive Prefix Tuning for Parameter-Efficient Language Model Fine-tuning

Arxiv

0+阅读 · 2023年5月24日

On Degrees of Freedom in Defining and Testing Natural Language Understanding

Arxiv

0+阅读 · 2023年5月24日

Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Arxiv

1+阅读 · 2023年5月24日

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Arxiv

0+阅读 · 2023年5月24日

LLM-powered Data Augmentation for Enhanced Crosslingual Performance

Arxiv

0+阅读 · 2023年5月23日

UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers

UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers

Arxiv

0+阅读 · 2023年5月22日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources

Arxiv

13+阅读 · 2019年11月14日

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Arxiv

11+阅读 · 2019年10月30日

相关基金

小胶质细胞过度激活在精神分裂症阴性症状中作用的信号通路研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于非独立同分布学习理论的图模型词义消歧及领域适应方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

演化优化的自适应约束处理机理及在生化过程中的应用

国家自然科学基金

0+阅读 · 2015年12月31日

Anderson型多酸的不对称修饰及可控组装研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于赫依三穴研究蒙医灸疗对PCPA致失眠模型大鼠镇静催眠作用机理研究

国家自然科学基金

0+阅读 · 2014年12月31日

雌激素上调酸敏感离子通道：疼痛性别差异的一个新分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

无监督分词及词性归纳联合方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

超声波振动辅助高密度倒装芯片塑封下填充工艺与机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

带约束和参数的多变量逼近的理论与方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

湿氧化爆破法处理木质纤维素的降解动力学的研究及评价

国家自然科学基金

0+阅读 · 2010年12月31日

微信扫码咨询专知VIP会员