启用国家数字图书馆:挪威变形器模型实例 (Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model)

In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library. The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks for both Norwegian Bokm{\aa}l and Norwegian Nynorsk. Our model also improves the mBERT performance for other languages present in the corpus such as English, Swedish, and Danish. For languages not included in the corpus, the weights degrade moderately while keeping strong multilingual properties. Therefore, we show that building high-quality models within a memory institution using somewhat noisy optical character recognition (OCR) content is feasible, and we hope to pave the way for other memory institutions to follow.

翻译：在这项工作中,我们展示了从一个国家图书馆的数字和数字化收藏中建立大规模培训集的过程,由此产生的挪威变异器基于语言的双向编码显示模型,在挪威博克姆-阿拉伊尔和挪威尼诺斯克的多种象征性和序列分类任务中,优于多语种BERT(mBERT)模型。我们的模型还提高了本体中其他语言,如英语、瑞典语和丹麦语的 mBERT性能。对于没有列入本体的语言来说,重量在保持强大的多语种特性的同时会中度下降。因此,我们表明,在记忆机构内建立高品质模型是可行的,使用有点吵闹的光学特征识别(OCR)内容,我们希望为其他记忆机构效仿铺平道路。

相关内容

MoDELS

关注 0

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

最新《Transformers模型》教程，64页ppt

专知会员服务

274+阅读 · 2020年11月26日

【ICML2020】文本摘要生成模型PEGASUS

专知会员服务

34+阅读 · 2020年8月23日

【RLChina2020公开课】Lecture-11.pdf【多智能体学习与游戏AI前沿】

专知会员服务

25+阅读 · 2020年8月6日

【Google】平滑对抗训练，Smooth Adversarial Training

专知会员服务

46+阅读 · 2020年7月4日