Currently, the most widespread neural network architecture for training language models is the so called BERT which led to improvements in various Natural Language Processing (NLP) tasks. In general, the larger the number of parameters in a BERT model, the better the results obtained in these NLP tasks. Unfortunately, the memory consumption and the training duration drastically increases with the size of these models. In this article, we investigate various training techniques of smaller BERT models: We combine different methods from other BERT variants like ALBERT, RoBERTa, and relative positional encoding. In addition, we propose two new fine-tuning modifications leading to better performance: Class-Start-End tagging and a modified form of Linear Chain Conditional Random Fields. Furthermore, we introduce Whole-Word Attention which reduces BERTs memory usage and leads to a small increase in performance compared to classical Multi-Head-Attention. We evaluate these techniques on five public German Named Entity Recognition (NER) tasks of which two are introduced by this article.
翻译:目前,培训语言模型最广泛的神经网络结构是所谓的BERT,它导致各种自然语言处理(NLP)任务的改进。一般来说,BERT模型的参数数量越多,这些NLP任务的结果越好。不幸的是,记忆消耗和培训期限随着这些模型的大小而急剧增加。在本篇文章中,我们调查了小型BERT模型的各种培训技术:我们结合了与ALBERT、ROBERTA和相对位置编码等其他BERT变量的不同方法。此外,我们建议了两种新的微调修改,以导致更好的性能:Asle-Start-End标记和一种修改式的线性链链式随机场。此外,我们引入了全方位注意,这减少了BERT的记忆使用,并导致与传统的多负责人保管模式相比性能略有提高。我们评估了这五种德国公共命名实体识别(NER)任务中的这些技术,其中两种任务是由本条引入的。