Word-Graph2vec: 使用随机步行抽样在单词共发图上有效嵌入单词的方法</s> (Word-Graph2vec: An efficient word embedding approach on word co-occurrence graph using random walk sampling)

Word embedding has become ubiquitous and is widely used in various text mining and natural language processing (NLP) tasks, such as information retrieval, semantic analysis, and machine translation, among many others. Unfortunately, it is prohibitively expensive to train the word embedding in a relatively large corpus. We propose a graph-based word embedding algorithm, called Word-Graph2vec, which converts the large corpus into a word co-occurrence graph, then takes the word sequence samples from this graph by randomly traveling and trains the word embedding on this sampling corpus in the end. We posit that because of the stable vocabulary, relative idioms, and fixed expressions in English, the size and density of the word co-occurrence graph change slightly with the increase in the training corpus. So that Word-Graph2vec has stable runtime on the large scale data set, and its performance advantage becomes more and more obvious with the growth of the training corpus. Extensive experiments conducted on real-world datasets show that the proposed algorithm outperforms traditional Skip-Gram by four-five times in terms of efficiency, while the error generated by the random walk sampling is small.

翻译：字嵌入已变得无处不在, 并被广泛用于各种文字挖掘和自然语言处理( NLP) 任务中, 比如信息检索、语义分析和机器翻译等。不幸的是, 将字嵌入到相对大的内容中, 花费太高了。我们提议了一个基于图形的字嵌入算法, 名为 Word- Grapph2vec, 将大体转换成单词共读图形, 然后通过随机旅行从此图中取出字序列样本, 并在最后将字嵌入此抽样中。我们假设, 由于英文的词汇、相对语义和固定表达方式稳定, 将单词的大小和密度随培训内容的增加而略有变化。因此, Word- Grap2vec 在大型数据集中拥有稳定的运行时间, 其性能优势随着培训内容的增长而变得越来越明显。在现实世界数据集上进行的广泛实验显示, 拟议的算法以四至五次的频率代表了传统的跳过小型跳格。而随机抽样则会产生错误。</s>

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

74+阅读 · 2022年3月15日

NLP必读经典文献100篇

专知会员服务

123+阅读 · 2020年9月8日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

76+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

123+阅读 · 2020年7月18日