Large-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus composition and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets. We open source not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production (with practical examples), and how to fully replicate the evaluation process.
翻译:大规模预先培训的语文模式在自然语言处理中已变得无处不在,但是,这些模式大多以高资源语言提供,特别是英语,或作为多种语言模式,损害个别语言的覆盖范围,本文件介绍罗马尼亚语BERT,这是第一个纯罗马尼亚语变压器语言模式,在大量文字材料上经过预先培训。我们讨论了物质构成和清洁、示范培训程序以及对罗马尼亚各数据集模型的广泛评价。我们不仅打开了模型本身,而且还打开了一个储存库,其中载有如何获取、微调和使用该模型进行制作的信息(附有实例),以及如何充分复制评价进程的信息。