中文BERT全字蒙面训练前 (Pre-Training with Whole Word Masking for Chinese BERT)

Bidirectional Encoder Representations from Transformers (BERT) has shown marvelous improvements across various NLP tasks. Recently, an upgraded version of BERT has been released with Whole Word Masking (WWM), which mitigate the drawbacks of masking partial WordPiece tokens in pre-training BERT. In this technical report, we adapt whole word masking in Chinese text, that masking the whole word instead of masking Chinese characters, which could bring another challenge in Masked Language Model (MLM) pre-training task. The model was trained on the latest Chinese Wikipedia dump. We aim to provide easy extensibility and better performance for Chinese BERT without changing any neural architecture or even hyper-parameters. The model is verified on various NLP tasks, across sentence-level to document-level, including sentiment classification (ChnSentiCorp, Sina Weibo), named entity recognition (People Daily, MSRA-NER), natural language inference (XNLI), sentence pair matching (LCQMC, BQ Corpus), and machine reading comprehension (CMRC 2018, DRCD, CAIL RC). Experimental results on these datasets show that the whole word masking could bring another significant gain. Moreover, we also examine the effectiveness of Chinese pre-trained models: BERT, ERNIE, BERT-wwm. We release the pre-trained model (both TensorFlow and PyTorch) on GitHub: https://github.com/ymcui/Chinese-BERT-wwm

翻译：来自变异器(BERT)的双向编码器演示显示,在各种NLP任务中,BERT的升级版已经显示出巨大的改进。最近,BERT的升级版已经与全字遮掩(WWMM)一起发行,这缓解了在培训前BERT中隐藏部分 WordPiece 标志的缺点。在这个技术报告中,我们调整了中文文本中的整字遮掩,掩盖了整字遮掩,而不是遮掩中文字符,这可能会在蒙面语言模型(MLM)培训前的任务中带来另一个挑战。该模型在最新的中国维基百科垃圾堆上进行了培训。我们的目标是在不改变任何神经结构甚至超参数的情况下为中国BERT提供容易的扩展和更好的性能。该模型在各种NLP任务上进行了核实,包括情绪分类(ChnSenticorporation,Sina WeWibo),名称识别(POR-NERNE),自然语言模型(XwLI),判决配对(LQMC,BCOus),以及机器阅读理解(CRC 2018,D,CD, CRAD,C,C,CREAR)中的另一个数据测试。