In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library. The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks for both Norwegian Bokm{\aa}l and Norwegian Nynorsk. Our model also improves the mBERT performance for other languages present in the corpus such as English, Swedish, and Danish. For languages not included in the corpus, the weights degrade moderately while keeping strong multilingual properties. Therefore, we show that building high-quality models within a memory institution using somewhat noisy optical character recognition (OCR) content is feasible, and we hope to pave the way for other memory institutions to follow.
翻译:在这项工作中,我们展示了从一个国家图书馆的数字和数字化收藏中建立大规模培训集的过程,由此产生的挪威变异器基于语言的双向编码显示模型,在挪威博克姆-阿拉伊尔和挪威尼诺斯克的多种象征性和序列分类任务中,优于多语种BERT(mBERT)模型。我们的模型还提高了本体中其他语言,如英语、瑞典语和丹麦语的 mBERT性能。对于没有列入本体的语言来说,重量在保持强大的多语种特性的同时会中度下降。因此,我们表明,在记忆机构内建立高品质模型是可行的,使用有点吵闹的光学特征识别(OCR)内容,我们希望为其他记忆机构效仿铺平道路。