简单高效的Bert中文文本分类模型开发和部署

会员服务 ·

简单高效的Bert中文文本分类模型开发和部署

2019 年 6 月 6 日 AINLP

推荐一个Github项目：简单高效的Bert中文文本分类模型开发和部署

BERT-chinese-text-classification-and-deployment

作者是AINLP交流群里的 SunYanCN 同学，项目链接，可点击阅读原文直达：

https://github.com/SunYanCN/BERT-chinese-text-classification-and-deployment

简单高效的Bert中文文本分类模型开发和部署

准备环境工作

操作系统：Linux
TensorFlow Version：1.13.1，动态图模式
GPU：我的服务器是Tesla P4 8G GPU，文档后面有显存不足的解决方案
TensorFlow Serving：simple-tensorflow-serving
依赖库：requirements.txt

目录结构说明

bert是官方源码
data是数据，来自项目，文本的3分类问题
train.sh、classifier.py 训练文件
export.sh、export.py导出TF serving的模型
client.sh、client.py、file_base_client.py 处理输入数据并向部署的TF serving的模型发出请求，打印输出结果

训练代码

和项目基本一致，特殊的地方我会指出。

写一个自己的文本处理器。有两点需要注意：1，改写label 2，把create_examples改成了共有方法，因为我们后面要调用。3，file_base的时候注意跳过第一行，文件数据的第一行是title

class MyProcessor(DataProcessor):

    def get_test_examples(self, data_dir):
        return self.create_examples(
            self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

    def get_train_examples(self, data_dir):
        """See base class."""
        return self.create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self.create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_pred_examples(self, data_dir):
        return self.create_examples(
            self._read_tsv(os.path.join(data_dir, "pred.tsv")), "pred")

    def get_labels(self):
        """See base class."""
        return ["-1", "0", "1"]

    def create_examples(self, lines, set_type, file_base=True):
        """Creates examples for the training and dev sets. each line is label+\t+text_a+\t+text_b """
        examples = []
        for (i, line) in tqdm(enumerate(lines)):

            if file_base:
                if i == 0:
                    continue

            guid = "%s-%s" % (set_type, i)
            text = tokenization.convert_to_unicode(line[1])
            if set_type == "test" or set_type == "pred":
                label = "0"
            else:
                label = tokenization.convert_to_unicode(line[0])
            examples.append(
                InputExample(guid=guid, text_a=text, label=label))
        return examples

其他的训练代码，照抄官方的就行
可以直接运行train.sh，注意修改对应的路径
生成的ckpt文件在output路径下

导出模型

主要代码如下，生成的pb文件在api文件夹下

def serving_input_receiver_fn():
    input_ids = tf.placeholder(dtype=tf.int64, shape=[None, FLAGS.max_seq_length], name='input_ids')
    input_mask = tf.placeholder(dtype=tf.int64, shape=[None, FLAGS.max_seq_length], name='input_mask')
    segment_ids = tf.placeholder(dtype=tf.int64, shape=[None, FLAGS.max_seq_length], name='segment_ids')
    label_ids = tf.placeholder(dtype=tf.int64, shape=[None, ], name='unique_ids')

    receive_tensors = {'input_ids': input_ids, 'input_mask': input_mask, 'segment_ids': segment_ids,
                       'label_ids': label_ids}
    features = {'input_ids': input_ids, 'input_mask': input_mask, 'segment_ids': segment_ids, "label_ids": label_ids}
    return tf.estimator.export.ServingInputReceiver(features, receive_tensors)

estimator.export_savedmodel(FLAGS.serving_model_save_path, serving_input_receiver_fn)

TensorFlow Serving部署

一键部署：

simple_tensorflow_serving --model_base_path="./api"

正常启动终端界面：

浏览器访问界面：

这部分认真阅读simple-tensorflow-serving的文档

本地请求代码

分为两种，一种是读取文件的，就是要预测的文本是tsv文件的，叫做file_base_client.py，另一个直接输入文本的是client.py。首先更改input_fn_builder，返回dataset，然后从dataset中取数据，转换为list格式，传入模型，返回结果。

正常情况下的运行结果：

问题解答

训练的显存不足怎么办
答：按照官方的建议，调小max_seq_length和train_batch_size

TODO LIST

接入Docker
微信端交互代码

登录查看更多

相关内容

分类模型

关注 0

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【新书】学习TensorFlow2.0，177页pdf，使用Python实现机器学习和深度学习模型

专知会员服务

224+阅读 · 2019年12月28日

【论文推荐】基于BERT修剪的问答模型（Pruning a BERT-based Question Answering Model）

专知会员服务

30+阅读 · 2019年11月22日

【课程】伯克利2019全栈深度学习课程（附下载）

专知会员服务

57+阅读 · 2019年10月29日

【Github】GPT2-Chinese：中文的GPT2训练代码

AINLP

52+阅读 · 2019年8月23日

使用 Bert 预训练模型文本分类（内附源码）

数据库开发

102+阅读 · 2019年3月12日

NLP - 基于 BERT 的中文命名实体识别（NER)

AINLP

466+阅读 · 2019年2月10日

NLP - 15 分钟搭建中文文本分类模型

AINLP

79+阅读 · 2019年1月29日

tensorflow LSTM + CTC实现端到端OCR

机器学习研究会

26+阅读 · 2017年11月16日

DocBERT: BERT for Document Classification

Arxiv

6+阅读 · 2019年8月22日

A BERT Baseline for the Natural Questions

Arxiv

8+阅读 · 2019年3月21日

Multi-task learning to improve natural language understanding

Arxiv

4+阅读 · 2018年12月17日

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Arxiv

15+阅读 · 2018年10月11日

Universal Language Model Fine-tuning for Text Classification

Arxiv

3+阅读 · 2018年5月23日

VIP会员