自然语言处理领域中,判断两个单词是不是一对上下文词(context)与目标词(target),如果是一对,则是正样本,如果不是一对,则是负样本。采样得到一个上下文词和一个目标词,生成一个正样本(positive example),生成一个负样本(negative example),则是用与正样本相同的上下文词,再在字典中随机选择一个单词,这就是负采样(negative sampling)。

VIP内容

对比学习(contrastive learning)是对于给定的正例样本和负例样本,通过让编码器学习如何去区分它们,从而捕捉到样本中最具判别性的特征。因为这种隶属于自监督学习的方式,模型框架清晰易懂,效果异常优越,受到了很多顶会论文的青睐。今天将分享两篇KDD2020会议上的论文:一篇将对比学习应用于图预训练任务上;另一篇深度解析了负采样技术在图表示学习中的作用,能为对比学习的进一步发展带来启发。

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training

论文地址:https://arxiv.org/abs/2006.09963

本文提出了一种无监督的图表示学习预训练框架GCC,能够捕捉到广泛存在于不同图之间的拓扑性质,并且无需输入额外的属性或是标签。GCC将预训练任务设定为在同一张图内或不同图之间区分子图级别的实例,进一步利用对比学习使模型能够学到固有的、可迁移的结构表示。最后通过一系列的实验,验证了pre-training & fine-tuning模式在图表示学习中的巨大潜力。

Understanding Negative Sampling in Graph Representation Learning

论文地址:https://arxiv.org/abs/2005.09863

本文分别从目标函数和风险两个角度深刻剖析了负采样技术在图表示学习中起到的作用,并通过理论证明:负采样的分布应该和正样本分布呈正相关但是亚线性的关系。基于该理论,本文进一步提出了新的负采样策略,即MCNS,并利用改进版的Metropolis-Hastings算法对该过程进行了加速。

成为VIP会员查看完整内容
0
33

最新论文

Self-supervised learning has attracted great interest due to its tremendous potentials in learning discriminative representations in an unsupervised manner. Along this direction, contrastive learning achieves current state-of-the-art performance. Despite the acknowledged successes, existing contrastive learning methods suffer from very low learning efficiency, e.g., taking about ten times more training epochs than supervised learning for comparable recognition accuracy. In this paper, we discover two contradictory phenomena in contrastive learning that we call under-clustering and over-clustering problems, which are major obstacles to learning efficiency. Under-clustering means that the model cannot efficiently learn to discover the dissimilarity between inter-class samples when the negative sample pairs for contrastive learning are insufficient to differentiate all the actual object categories. Over-clustering implies that the model cannot efficiently learn the feature representation from excessive negative sample pairs, which include many outliers and thus enforce the model to over-cluster samples of the same actual categories into different clusters. To simultaneously overcome these two problems, we propose a novel self-supervised learning framework using a median triplet loss. Precisely, we employ a triplet loss tending to maximize the relative distance between the positive pair and negative pairs to address the under-clustering problem; and we construct the negative pair by selecting the negative sample of a median similarity score from all negative samples to avoid the over-clustering problem, guaranteed by the Bernoulli Distribution model. We extensively evaluate our proposed framework in several large-scale benchmarks (e.g., ImageNet, SYSU-30k, and COCO). The results demonstrate the superior performance of our model over the latest state-of-the-art methods by a clear margin.

0
0
下载
预览
Top