Pre-trained language models such as BERT have exhibited remarkable performances in many tasks in natural language understanding (NLU). The tokens in the models are usually fine-grained in the sense that for languages like English they are words or sub-words and for languages like Chinese they are characters. In English, for example, there are multi-word expressions which form natural lexical units and thus the use of coarse-grained tokenization also appears to be reasonable. In fact, both fine-grained and coarse-grained tokenizations have advantages and disadvantages for learning of pre-trained language models. In this paper, we propose a novel pre-trained language model, referred to as AMBERT (A Multi-grained BERT), on the basis of both fine-grained and coarse-grained tokenizations. For English, AMBERT takes both the sequence of words (fine-grained tokens) and the sequence of phrases (coarse-grained tokens) as input after tokenization, employs one encoder for processing the sequence of words and the other encoder for processing the sequence of the phrases, utilizes shared parameters between the two encoders, and finally creates a sequence of contextualized representations of the words and a sequence of contextualized representations of the phrases. Experiments have been conducted on benchmark datasets for Chinese and English, including CLUE, GLUE, SQuAD and RACE. The results show that AMBERT outperforms the existing best performing models in almost all cases, particularly the improvements are significant for Chinese.
翻译:BERT等经过事先训练的语言模型在许多自然语言理解(NLU)的任务中表现出了显著的成绩。模型中的标语通常都是精细的,因为对于英语等语言来说,它们是文字或子字,对于中文等语言来说,它们是字符。例如,在英语中,多字表达式形成自然的词汇单位,因此使用粗化的表示式似乎也是合理的。事实上,精化和粗化的表示式对于学习经过训练的语言模型(NLU)都具有优劣的优势和劣势。在本文中,我们提出了一个新的预先训练的语言模型,称为AMBERT(A多加色的BERT),在精细化和粗化的表示式两种基础上,用精细微的表示式表达式(fine-graized formessions)和顺序(粗化的表示式表示式表示式)作为符号输入后,使用一个编码器处理语言学前语言模型的顺序,在SMBERT的顺序中,在SBERD的顺序中,在Sloial Redual Redual 的顺序中,在Scial Redual Redual 的顺序中,所有的顺序中,在Slocial Redudududududududududududududududududududule 中,所有的缩的缩的缩的顺序中,在使用了所有的缩的缩的缩的缩的缩的缩的缩的缩的顺序中,在使用了所有的缩的缩的缩的缩的缩的缩缩缩缩缩的缩略的缩略的缩略的缩的缩的缩略的缩表。