Chinese word segmentation is necessary to provide word-level information for Chinese named entity recognition (NER) systems. However, segmentation error propagation is a challenge for Chinese NER while processing colloquial data like social media text. In this paper, we propose a model (UIcwsNN) that specializes in identifying entities from Chinese social media text, especially by leveraging ambiguous information of word segmentation. Such uncertain information contains all the potential segmentation states of a sentence that provides a channel for the model to infer deep word-level characteristics. We propose a trilogy (i.e., candidate position embedding -> position selective attention -> adaptive word convolution) to encode uncertain word segmentation information and acquire appropriate word-level representation. Experiments results on the social media corpus show that our model alleviates the segmentation error cascading trouble effectively, and achieves a significant performance improvement of more than 2% over previous state-of-the-art methods.
翻译:中文文字分解对于为中文名称实体识别系统提供单词级信息十分必要。 但是,分解错误传播对于中国国家网络在处理社交媒体文本等学术数据时是一个挑战。 在本文中,我们提出了一个模型(UIcwsNNN),专门识别中国社交媒体文本中的实体,特别是利用模糊的单词分解信息。这种不确定信息包含一个句子的所有潜在分解状态,该句子为该模型提供了推导深单词级特征的渠道。我们提出了三部曲(即候选人职位嵌入 - > 位置选择性关注 - > 适应性单词共进化),以编码不确定的字分解信息并获得适当的字级代表。社会媒体资料的实验结果显示,我们的模型有效地缓解了分解错误导致麻烦的分解错误,并取得了比以往的先进方法高出2%的显著性能改进。