NLP - 基于 BERT 的中文命名实体识别(NER)

2 月 10 日 NLPJob

作者:艾力亚尔(微博 @艾力亚尔),暴风大脑研究院研发工程师,现负责电视端的语音助手相关工作。

原文地址,可点击文末“阅读原文”直达:

https://eliyar.biz/nlp_chinese_bert_ner/

欢迎大家投稿,AI、NLP相关即可。



序列标注任务是中文自然语言处理(NLP)领域在句子层面中的主要任务,在给定的文本序列上预测序列中需要作出标注的标签。常见的子任务有命名实体识别(NER)、Chunk 提取以及词性标注(POS)等。

BERT 模型刷新了自然语言处理的 11 项记录,成为 NLP 行业的新标杆。既然 Google 开源这么好的模型架构和预训练的中文模型,那我们就使用它构建一个序列标注模型。


PS: 最近我开源了一个极简文本分类和序列标注框架 Kashgari(https://github.com/BrikerMan/Kashgari) ,今天的教程将使用这个框架构建模型。如果想了解文本分类,可以看下面的文章。

搭建环境和数据准备

准备工作,先准备 python 环境,下载 BERT 语言模型。

  • Python 3.6 环境

  • BERT-Base, Chinese 中文模型

虚拟环境中安装所有需要的依赖


1
2
pip install kashgari
pip install tensorflow


读取数据


1
2
3
4
5
6
7
8
9
10
train_x, train_y = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('train')
validate_x, validate_y = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('validate')
test_x, test_y  = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('test')

print(f"train data count: {len(train_x)}")
print(f"validate data count: {len(validate_x)}")
print(f"test data count: {len(test_x)}")
train data count: 20864
validate data count: 2318
test data count: 4636


创建 BERT embedding


1
2
from kashgari.embeddings import BERTEmbedding
embedding = BERTEmbedding('<PATH_TO_BERT_FOLDER>', 100)


创建模型并训练


1
2
3
4
5
6
7
8
9
10
11
from kashgari.tasks.seq_labeling import BLSTMCRFModel

# 还可以选择 `BLSTMModel` 和 `CNNLSTMModel`

model = BLSTMCRFModel(embedding)
model.fit(train_x,
         train_y,
         y_validate=y_validate,
         x_validate=x_validate,
         epochs=200,
         batch_size=500)


模型评估结果

模型训练

测试环境: V100, 2CPU 40G 感谢 OpenBayes BayesGear 算力容器 提供算力支持。
基础参数:

  • batch_size: 2317

  • sequence_length: 100

  • epochs: 200

F1 score


CNN-LSTM B-LSTM B-LSTM-CRF
bare embedding 0.5275 0.6569 0.6805
Word2vec 0.5042 0.6686 0.7341
BERT 0.8212 0.9043 0.9220

每 epoch 耗时


CNN-LSTM B-LSTM B-LSTM-CRF
bare embedding 4s 7s 19s
Word2vec 5s 7s 20s
BERT 40s 46s 60s

最好成绩是 BERT + B-LSTM-CRF 模型效果最好。详细得分如下:


1
2
3
4
5
6
7
             precision    recall  f1-score   support

       LOC     0.9208    0.9324    0.9266      3431
       ORG     0.8728    0.8882    0.8804      2147
       PER     0.9622    0.9633    0.9627      1797

avg / total     0.9169    0.9271    0.9220      7375



模型预测

预测环境: MacBook Pro 13, 2 GHz Intel Core i5, 8G RAM

模型初始化耗时


CNN-LSTM B-LSTM B-LSTM-CRF
bare embedding 13.535s 9.498s 8.739s
Word2vec 20.042s 14.942s 12.553s
BERT 37.952s 21.986s 24.435s

50个句子一次性预测


CNN-LSTM B-LSTM B-LSTM-CRF
bare embedding 1.502s 1.395s 0.869s
Word2vec 1.034s 1.901s 0.876s
BERT 36.463s 31.252s 26.601s

50个句子循环预测时每个句子预测时间


CNN-LSTM B-LSTM B-LSTM-CRF
bare embedding 0.014s 0.019s 0.035s
Word2vec 0.015s 0.019s 0.052s
BERT 0.606s 0.641s 0.573s

可以看得出同样的模型结构,BERT 能大幅度提高,但是也会导致训练时间,模型大小和预测时间大幅度上升。如果想在线实时预测可能性能无法达标,得考虑通过缓存之类的方案解决。


登录查看更多
点赞 0

Clinical Named Entity Recognition (CNER) aims to identify and classify clinical terms such as diseases, symptoms, treatments, exams, and body parts in electronic health records, which is a fundamental and crucial task for clinical and translational research. In recent years, deep neural networks have achieved significant success in named entity recognition and many other Natural Language Processing (NLP) tasks. Most of these algorithms are trained end to end, and can automatically learn features from large scale labeled datasets. However, these data-driven methods typically lack the capability of processing rare or unseen entities. Previous statistical methods and feature engineering practice have demonstrated that human knowledge can provide valuable information for handling rare and unseen cases. In this paper, we address the problem by incorporating dictionaries into deep neural networks for the Chinese CNER task. Two different architectures that extend the Bi-directional Long Short-Term Memory (Bi-LSTM) neural network and five different feature representation schemes are proposed to handle the task. Computational results on the CCKS-2017 Task 2 benchmark dataset show that the proposed method achieves the highly competitive performance compared with the state-of-the-art deep learning methods.

点赞 0
阅读11+
小贴士
Top