罗马尼亚BERT的出生 (The birth of Romanian BERT)

Large-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus composition and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets. We open source not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production (with practical examples), and how to fully replicate the evaluation process.

翻译：大规模预先培训的语文模式在自然语言处理中已变得无处不在,但是,这些模式大多以高资源语言提供,特别是英语,或作为多种语言模式,损害个别语言的覆盖范围,本文件介绍罗马尼亚语BERT,这是第一个纯罗马尼亚语变压器语言模式,在大量文字材料上经过预先培训。我们讨论了物质构成和清洁、示范培训程序以及对罗马尼亚各数据集模型的广泛评价。我们不仅打开了模型本身,而且还打开了一个储存库,其中载有如何获取、微调和使用该模型进行制作的信息(附有实例),以及如何充分复制评价进程的信息。

相关内容

MoDELS

关注 30

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【上海交大-字节跳动】在神经机器翻译中充分利用BERT，Making the Most of BERT in NMT

专知会员服务

23+阅读 · 2020年3月28日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

30+阅读 · 2020年2月21日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

52+阅读 · 2020年1月30日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

53+阅读 · 2019年10月17日