文本挖掘从小白到精通（十七）--- 只有少量标注文本数据怎么办？

会员服务 ·

文本挖掘从小白到精通（十七）--- 只有少量标注文本数据怎么办？

2020 年 8 月 15 日 AINLP

特别推荐|【文本挖掘系列教程】：

我们在使用文本挖掘模型的时候，经常会出现标注数据过少的情况，此种情况下，哪怕再高级的分类模型也难以达到较好的识别效果。

现在，假设我们面临一个问题分类的窘况：手中有一些标注数据，但是不多，更多的是没有标记的文本数据，这时候能否有一种能够充分利用已有的未标注数据来提升分类模型的训练效果，最好能”顺便“标记下数据？

乍一看听起来比较扯淡---哪有这么智能的算法？

还别说，真有！

轮到无监督学习出场了~

1 什么是半监督模型

半监督学习 (Semi-Supervised Learning，SSL)是模式识别和机器学习领域研究的重点问题，是监督学习与无监督学习相结合的一种学习方法。半监督学习使用大量的未标记数据，以及同时使用标记数据，来进行模式识别工作。当使用半监督学习时，将会要求尽量少的人员来从事工作，利用少量的已标注数据进行指导并预测未标记数据的标记，并合并到标记数据集中去；同时，又能够带来比较高的准确性，

因而，半监督学习绝对是手里有模型无标注数据的调参侠们的福音！

先说说算法思路，这里主要关注两种：

生成模型 ：先计算样本特征的总体的联合分布，将所有有标注的样本计算出一个分布，然后把没有标注的样本放入这个分布中，看根据这个分布它该如何被标注，这个过程可能是迭代的
物以类聚 ：将有标注和没有标注的样本进行相似的比较，相似度高的，就将无标注样本按照临近的有标注样本进行标注，类似迭代过程。

本文中主要介绍的是第二种，涉及的算法是标签传播。

2 标签传播算法

标签传播算法（Label Propagation Algorithm）是基于图的半监督学习方法，基本思路是从已标记的节点的标签信息来预测未标记的节点的标签信息，利用样本间的关系，建立完全图模型。

每个节点标签按相似度传播给相邻节点，在节点传播的每一步，每个节点根据相邻节点的标签来更新自己的标签，与该节点相似度越大，其相邻节点对其标注的影响权值越大，相似节点的标签越趋于一致，其标签就越容易传播。在标签传播过程中，保持已标记的数据的标签不变，使其将标签传给未标注的数据。最终当迭代结束时，相似节点的概率分布趋于相似，可以划分到一类中。

标签传播表示半监督图推理算法的几个变种，且它有如下2个特性：

可用于分类和回归任务
将数据投射到另一维空间的内核方法(Kernel methods )

值得注意的是，sklearn.semi_supervised中的半监督估计器能够利用这些额外的未标签数据来更好地捕捉到底层数据分布的形状(the shape of the underlying data distribution)，并对新的样本进行更好的泛化。当我们有极少量的有标签数据和大量的无标签数据时，半监督算法可以取得较好的表现。

scikit-learn提供了两种标签传播模型 --- LabelPropagation和LabelSpreading，这两种模型的工作原理是在输入数据集的所有项上构建一个相似度图(similarity graph)。
LabelPropagation和LabelSpreading有相同点，也存在明显的差异：

LabelPropagation和LabelSpreading对图的相似度矩阵( similarity matrix)的修改，以及对标签分布的箝位效应( the clamping effect )。箝位允许算法在一定程度上改变真的标签数据的权重。LabelPropagation算法对输入标签进行硬箝位(hard clamping)，也就是a = 0，LabelPropagation算法对输入标签进行硬箝位。这个箝位系数可以放宽，可以说是a = 0.2，这意味着我们将始终保留80%的原始标签分布，但算法得到的置信度在20%以内。
LabelPropagation使用从数据中构建的原始相似度矩阵（the raw similarity matrix），不做任何修改。相比之下，LabelSpreading将具有正则化特性的损失函数最小化，因此它通常对噪声更稳健（划重点）。该算法对原始图的修改版本进行迭代，并通过计算归一化的图Laplacian矩阵对边缘权重进行归一化。这个过程也被用于频谱聚类（Spectral clustering）中。
LabelSpreading类似于基本的标签传播算法（The basic Label Propagation algorithm），但使用了基于归一化的graph Laplacian和soft clamping 的亲和矩阵（affinity matrix）在标签间进行传播。

标签传播模型有两种内置的内核方法。内核的选择对算法的可扩展性和性能都有影响。有以下2种方法可供选择。

rbf，距离离的越近越接近于1，距离离的越远越接近于0，由关键字gamma指定。
knn，找一个无标注的数据，然后取附近k个有标注的数据，无标注数据附近哪种标注的数据最多就取哪一个（以未标注的数据为圆心做knn，在指定范围内找到了有标注的数据，然后对未标注的数据进行打标，然后进行打标传播，直到未标注的数据全都标注以后，算法结束），由关键字n_neighbors指定。

rbf内核将产生一个完全连接的图，在内存中用一个密集矩阵表示。这个矩阵可能非常大，再加上算法每次迭代都要进行一次完整的矩阵乘法计算，会导致运行时间过长。另一方面，KNN 内核将产生一个更有利于内存的稀疏矩阵，可以大大减少运行时间。

由于LabelSpreading有较好的抗噪性，笔者将在下面的实例中使用该方法对含有少量标注数据和大量无标注的数据的样本集进行基于标签传播（Label Propagation）的无监督学习。

为了方便，笔者用于演示的数据来自sklearn自带的20newsgroups数据集，目前测试下来，这个半监督方法在长文本分类项目上效果奇好；如果是短文本的话，提取特征得用到当下最先进的transformer系预训练模型了。

3 实操案例 --- 利用sklearn中的LabelSpreading进行小样本学习

首先，导入必要的库。

import numpy as npimport matplotlib.pyplot as pltfrom scipy import statsfrom sklearn.datasets import fetch_20newsgroupsfrom sklearn import datasetsfrom sklearn.semi_supervised import LabelSpreadingfrom sklearn.metrics import classification_report, confusion_matrixfrom sklearn.feature_extraction.text import TfidfVectorizer

作为测试，只需要其中的5类即可，且训练集和测试集都用到，即参数“subset”取“all”。

categories = [             'rec.autos',             'talk.politics.guns',             'talk.politics.mideast',             'rec.sport.baseball',             'comp.sys.mac.hardware',             'soc.religion.christian']
newsgroup_train = fetch_20newsgroups(subset = 'all',categories = categories)

将数据集打乱，随机化操作。

rng = np.random.RandomState(0)indices = np.arange(len(newsgroup_train.target))rng.shuffle(indices)

抽取文本数据的特征，用到的是tf-idf特征，并用到1gram和2gram，并去掉停用词，频率超高65%的特征词排除，且最大特征数为15000（词表中的最大词汇数）。

vectorizer = TfidfVectorizer(   stop_words = 'english',                                max_df = 0.65,                                ngram_range=(1,2),                                max_features=15000)
fea_train = vectorizer.fit_transform(newsgroup_train.data)y_train = newsgroup_train.target

我们首先训练一个标签传播模型（label propagation model），只用300个标签点进行训练，然后选择前10个最不确定（most uncertain）的点进行标签传播。接下来，我们用这310个标签点进行训练（原始的300个点+10个新点），我们重复这个过程若20次，可以得到数量可观的标记数据。

当然，你可以通过改变max_iterations来标注更多的标签。标记更多的标签标签可以帮助我们了解这种主动学习技术的收敛速度。

注意：当用拟合方法训练模型时，为未标记的点和标记的数据一起分配一个标识符是很重要的，本实例中使用的标识符是整型值-1。

test_num = 2000
X = fea_train[indices[:test_num ]]y = y_train[indices[:test_num ]]images = np.array(newsgroup_train.data)[indices[:test_num]]
n_total_samples = len(y)n_labeled_points = 300max_iterations = 20
unlabeled_indices = np.arange(n_total_samples)[n_labeled_points:]

检视下未标注数据的index，注意这是随机的。

unlabeled_indices

array([ 100,  101,  102, ..., 1997, 1998, 1999])

在下面的每次迭代中，程序都会基于信息熵来显示其中机器 最拿不准的 TOP10文本数据，这些数字可能包含错误，也可能不包含错误，这些数据其实是我们在语料标注中最需要标注的，它们对分类的影响极为重要，有时我们也可以在这些不确定的预标注数据中找到错误的标注。在这里，这些不确定样例将会使用它们的真实标签（True labels），投入到下一轮次的模型训练中。

for i in range(max_iterations):    if len(unlabeled_indices) == 0:        print("没有待打标的候选标签项")        break    y_train = np.copy(y)    y_train[unlabeled_indices] = -1
    lp_model = LabelSpreading(                        gamma=0.25,                         kernel='knn',                        alpha = 0.5,                        n_neighbors =15,                        max_iter=50,                        n_jobs = -1                        )    lp_model.fit(X.toarray(), y_train)
    predicted_labels = lp_model.transduction_[unlabeled_indices]    true_labels = y[unlabeled_indices]
    cm = confusion_matrix(true_labels, predicted_labels,                          labels=lp_model.classes_)
    print("【迭代轮次】 %i %s" % (i, 70 * "_"))    print("LabelSpreading model: %d 个已标记 & %d 个未标记 (%d 个总数)"          % (n_labeled_points, n_total_samples - n_labeled_points,             n_total_samples))
    print(classification_report(        true_labels,             predicted_labels,            target_names = [                     'rec.autos',                     'talk.politics.guns',                     'talk.politics.mideast',                     'rec.sport.baseball',                     'comp.sys.mac.hardware',                     'soc.religion.christian']            ))
    print("【混淆矩阵】")    print(cm)
    # compute the entropies of transduced label distributions    pred_entropies = stats.distributions.entropy(        lp_model.label_distributions_.T)
    # select up to 10 digit examples that the classifier is most uncertain about    uncertainty_index = np.argsort(pred_entropies)[::-1]    uncertainty_index = uncertainty_index[        np.in1d(uncertainty_index, unlabeled_indices)][:10]
    # keep track of indices that we get labels for    delete_indices = np.array([], dtype=int)

    print('【最不确定样本呈现】\n',image)    for index, image_index in enumerate(uncertainty_index):        image = images[image_index]

        if i < max_iterations:
            print('……………'*5)            print("预测标签: {}\n真实标签: {}".format(                newsgroup_train.target_names[lp_model.transduction_[image_index]], newsgroup_train.target_names[y[image_index]]))            print('******************'*5)
        # labeling 10 points, remote from labeled set        delete_index, = np.where(unlabeled_indices == image_index)        delete_indices = np.concatenate((delete_indices, delete_index))
    unlabeled_indices = np.delete(unlabeled_indices, delete_indices)    n_labeled_points += len(uncertainty_index)    print('=========第 {} 轮结束~============'.format(i))

【迭代轮次】 0 ______________________________________________________________________
LabelSpreading model: 300 个已标记 & 1700 个未标记 (2000 个总数)
                        precision    recall  f1-score   support

             rec.autos       0.73      0.73      0.73       277
    talk.politics.guns       0.65      0.81      0.72       272
 talk.politics.mideast       0.75      0.76      0.75       273
    rec.sport.baseball       0.91      0.80      0.85       323
 comp.sys.mac.hardware       0.79      0.79      0.79       268
soc.religion.christian       0.93      0.83      0.87       287

              accuracy                           0.79      1700
             macro avg       0.79      0.79      0.79      1700
          weighted avg       0.80      0.79      0.79      1700

【混淆矩阵】
[[203  37  15   5  14   3]
 [ 12 221  20   8  11   0]
 [ 20  26 207   7  10   3]
 [ 16  22   9 257  11   8]
 [ 14  18  17   4 211   4]
 [ 14  15   9   1  11 237]]
【最不确定样本呈现】
 From: yoony@aix.rpi.edu (Young-Hoon Yoon)
Subject: Re: JFFO has gone a bit too far
Nntp-Posting-Host: aix.rpi.edu
Distribution: usa
Lines: 29

rats@cbnewsc.cb.att.com (Morris the Cat) writes:


>|>Would somebody please post evidence that the gun control act of
>|>1968 is "a verbatim transcription" of a nazi law?

>|The "evidence" is that the two laws are basically identical.
>|However, that's not evidence that one is a copy of the other.

>|There's no evidence that the 68 GCA's authors used the nazi law as a
>|guide.  Yes, they ended up with roughly the same thing, but that comes
>|from their shared goal, disarming those menacing minorities.

>I thought the same thing too, until JPFO's RKBA article
>in the latest Guns & Ammo
>at the newstands. This article makes it certain that Sen. Thomas Dodd
>(D-MD?) back before 1968 definitely asked for a translation of the
>German weapons laws back then. Read the article, and see what you think
>of JPFO's argument. They note that Ted Kennedy and John Dingell are
>among the three of the originals left from the 1968 stuff, and they
>are asking that folks request of John Dingell that he introduce
>legislation to lift GCA '68, something which I would support whole-
>heartedly!

>|-andy

Can someone post a general idea of what GCA '68 does?
Thanks.


…………………………………………………………………
预测标签: comp.sys.mac.hardware
真实标签: talk.politics.guns
******************************************************************************************
【最不确定样本呈现】
 From: lau@aerospace.aero.org (David Lau)
Subject: Re: Accelerating the MacPlus...;)
Nntp-Posting-Host: michigan.aero.org
Organization: The Aerospace Corporation; El Segundo, CA
Lines: 17

  Also, if someone would recommend another
> accelerator for the MacPlus, I'd like to hear about it.
>
> Thanks for any time and effort you expend on this!
>
> Karl

Try looking at the Brainstorm Accelerator for the Plus.  I believe it is
the best solution because of the performance and price.  Why spend $800
upgrading a computer that is only worth $300 ????
  The brainstorm accelerator is around $225.  It speeds up the internal
clock speed to 16MHz.  That may not seem like much but it also speeds up
SCSI transfers.  I think that feature is unique to brainstorm.
Check it out.

David Lau
lau@aerospace.aero.org

…………………………………………………………………
预测标签: comp.sys.mac.hardware
真实标签: comp.sys.mac.hardware
******************************************************************************************
【最不确定样本呈现】
 From: C604223@mizzou1.missouri.edu (Cho Chuen Wong)
Subject: Performa Plus monitor
Nntp-Posting-Host: mizzou1.missouri.edu
Organization: University of Missouri
Lines: 3

I would like to know if a Performa Plus monitor is compatible with Apple 14in
Color Display, or it is just a VGA moniro.  Any help will be appreciate.
 

…………………………………………………………………
预测标签: comp.sys.mac.hardware
真实标签: comp.sys.mac.hardware
******************************************************************************************
【最不确定样本呈现】
 From: murthy@ssdsun.asl.dl.nec.com (Vasudev Murthy)
Subject: Re: Saudi clergy condemns debut of human rights group!
Keywords: international, non-usa government, government, civil rights, 	social issues, politics
Nntp-Posting-Host: ssdsun
Organization: NEC America, Inc Irving TX
Lines: 21

In article <39898@optima.cs.arizona.edu> bakken@cs.arizona.edu (Dave Bakken) writes:
[deleted]
>
>Is this really what you (and Rached and others in the general
>west-is-evil-zionists-rule-hate-west-or-you-are-a-puppet crowd)
>want, Ilyess?

It's noteworthy that the posts about the west being
evil etc are made not in some Islamic hellhole but from
the west. If the west is so bad, why do they come here?
Notice how they comfortably exercise their rights to
free expression, something completely absent in their
own countries.

Vasudev

...

=========第 11 轮结束~============
【迭代轮次】 12 ______________________________________________________________________
LabelSpreading model: 420 个已标记 & 1580 个未标记 (2000 个总数)
                        precision    recall  f1-score   support

             rec.autos       0.79      0.80      0.80       256
    talk.politics.guns       0.73      0.86      0.79       251
 talk.politics.mideast       0.77      0.81      0.79       250
    rec.sport.baseball       0.96      0.79      0.87       306
 comp.sys.mac.hardware       0.80      0.83      0.81       247
soc.religion.christian       0.91      0.87      0.89       270

              accuracy                           0.83      1580
             macro avg       0.83      0.83      0.83      1580
          weighted avg       0.83      0.83      0.83      1580

【混淆矩阵】
[[204  24  11   1  13   3]
 [  9 217  11   2  11   1]
 [ 12  22 203   2   8   3]
 [ 15  14  18 241  11   7]
 [  9   9  15   1 205   8]
 [  8  10   5   3   9 235]]

...

【迭代轮次】 19 ______________________________________________________________________
LabelSpreading model: 890 个已标记 & 1110 个未标记 (2000 个总数)
                        precision    recall  f1-score   support

             rec.autos       0.93      0.96      0.95       159
    talk.politics.guns       0.92      0.98      0.95       155
 talk.politics.mideast       0.90      0.95      0.92       157
    rec.sport.baseball       0.99      0.93      0.96       241
 comp.sys.mac.hardware       0.97      0.96      0.96       177
soc.religion.christian       0.99      0.95      0.97       221

              accuracy                           0.95      1110
             macro avg       0.95      0.96      0.95      1110
          weighted avg       0.95      0.95      0.95      1110

【混淆矩阵】
[[153   3   2   0   1   0]
 [  1 152   1   0   1   0]
 [  1   5 149   0   2   0]
 [  6   1   6 225   2   1]
 [  0   2   4   0 170   1]
 [  3   3   4   2   0 209]]

可以看到，经过20个epoch的训练后，我们得到了数量可观的标注数据，同时，模型的准确度也在不断提升。对于其中机器拿不准的样例，我们得好好研究，发现其中的问题所在：是标注错误了？还是确实太相近了？或者是我们的分类体系本身就有问题！

在短文本分类任务中，上述方法得变通些，因为语义稀疏性嘛~特征抽取试试时下流行的bert、roberta、xnet等，试过的，有良好效果的记得和折耳喵勾兑分享下~

欢迎加入AINLP技术交流群

进群请添加AINLP小助手微信 AINLPer（id: ainlper)，备注NLP技术交流

推荐阅读

这个NLP工具，玩得根本停不下来

征稿启示| 200元稿费+5000DBC（价值20个小时GPU算力）

完结撒花！李宏毅老师深度学习与人类语言处理课程视频及课件（附下载）

从数据到模型，你可能需要1篇详实的pytorch踩坑指南

如何让Bert在finetune小数据集时更“稳”一点

模型压缩实践系列之——bert-of-theseus，一个非常亲民的bert压缩方法

文本自动摘要任务的“不完全”心得总结番外篇——submodular函数优化

Node2Vec 论文+代码笔记

模型压缩实践收尾篇——模型蒸馏以及其他一些技巧实践小结

中文命名实体识别工具（NER）哪家强？

学自然语言处理，其实更应该学好英语

斯坦福大学NLP组Python深度学习自然语言处理工具Stanza试用

关于AINLP

AINLP 是一个有趣有AI的自然语言处理社区，专注于 AI、NLP、机器学习、深度学习、推荐算法等相关技术的分享，主题包括文本摘要、智能问答、聊天机器人、机器翻译、自动生成、知识图谱、预训练模型、推荐系统、计算广告、招聘信息、求职经验分享等，欢迎关注！加技术交流群请添加AINLPer(id：ainlper)，备注工作/研究方向+加群目的。