Most Named Entity Recognition (NER) systems use additional features like part-of-speech (POS) tags, shallow parsing, gazetteers, etc. Such kind of information requires external knowledge like unlabeled texts and trained taggers. Adding these features to NER systems have been shown to have a positive impact. However, sometimes creating gazetteers or taggers can take a lot of time and may require extensive data cleaning. In this paper for Chinese NER systems, we do not use these traditional features but we use lexicographic features of Chinese characters. Chinese characters are composed of graphical components called radicals and these components often have some semantic indicators. We propose CNN based models that incorporate this semantic information and use them for NER. Our models show an improvement over the baseline BERT-BiLSTM-CRF model. We set a new baseline score for Chinese OntoNotes v5.0 and show an improvement of +.64 F1 score. We present a state-of-the-art F1 score on Weibo dataset of 71.81 and show a competitive improvement of +0.72 over baseline on ResumeNER dataset.
翻译:此类信息需要外部知识,如未贴标签的文本和经过培训的标签。将这些特征添加到 NER 系统已经证明具有积极影响。然而,有时创建地名录或标签系统需要花费大量时间,可能需要大量数据清理。在中国NER系统中,我们不使用这些传统特征,但我们使用中文字符的词汇特征。中文字符由称为激进的图形组件组成,这些组件往往有一些语义指标。我们提出了基于CNN的模型,将这种语义信息纳入其中,并将其用于NER。我们的模型显示比基线BERT-BILSTM-CRF模型有改进之处。我们为中国OntoNotes v 5.0 设定了新的基线分数,并显示+.64 F1分的改进。我们在Wibo数据集上展示了71.81分的最新F1分,并显示在ResumeNER数据设置基线方面有竞争性地改进了+0.72。