Experiments with transfer learning on pre-trained language models such as BERT have shown that the layers of these models resemble the classical NLP pipeline, with progressively more complex tasks being concentrated in later layers of the network. We investigate to what extent these results also hold for a language other than English. For this we probe a Dutch BERT-based model and the multilingual BERT model for Dutch NLP tasks. In addition, by considering the task of part-of-speech tagging in more detail, we show that also within a given task, information is spread over different parts of the network and the pipeline might not be as neat as it seems. Each layer has different specialisations and it is therefore useful to combine information from different layers for best results, instead of selecting a single layer based on the best overall performance.
翻译:在培训前语言模型(如BERT)上传授学习的实验表明,这些模型的层层类似于古典NLP编织管道,逐渐将更复杂的任务集中在网络的后层。我们调查这些结果在多大程度上也有利于英语以外的一种语言。我们为此探索荷兰的BERT模型和荷兰国家语言模型任务多语种的BERT模型。此外,通过更详细地考虑部分语音标记的任务,我们表明,在特定任务中,信息分散在网络的不同部分,管道可能不那么整齐。每个层都有不同的专门性,因此,将不同层的信息综合起来以取得最佳结果是有益的,而不是根据最佳的总体绩效选择一个单一层。