通过PU学习为低资源语言建立学习文字嵌入式 (LearningWord Embeddings for Low-resource Languages by PU Learning)

Word embedding is a key component in many downstream applications in processing natural languages. Existing approaches often assume the existence of a large collection of text for learning effective word embedding. However, such a corpus may not be available for some low-resource languages. In this paper, we study how to effectively learn a word embedding model on a corpus with only a few million tokens. In such a situation, the co-occurrence matrix is sparse as the co-occurrences of many word pairs are unobserved. In contrast to existing approaches often only sample a few unobserved word pairs as negative samples, we argue that the zero entries in the co-occurrence matrix also provide valuable information. We then design a Positive-Unlabeled Learning (PU-Learning) approach to factorize the co-occurrence matrix and validate the proposed approaches in four different languages.

翻译：语言嵌入是许多处理自然语言的下游应用中的一个关键组成部分。现有方法往往假设存在大量用于学习有效语言嵌入的文本集。然而, 这样的材料可能无法用于某些低资源语言。在本文中, 我们研究如何有效地学习一个单词嵌入模式, 仅用几百万个符号。在这种情况下, 共同发生矩阵很少, 因为许多单词对的共发率没有观测到。与现有方法相比, 通常只是将少数几个未观测到的单词对作为负面样本, 我们争论说, 共同发生的矩阵中的零条目也可以提供有价值的信息。然后我们设计一个积极的无标签学习( PU- Learch) 方法, 以四种不同语言将共发式矩阵考虑在内, 并验证拟议的方法。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【MIT】反偏差对比学习，Debiased Contrastive Learning

专知会员服务

90+阅读 · 2020年7月4日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

95+阅读 · 2020年5月31日

【领域对抗学习的低资源文本分类】Low-Resource Text Classification using Domain-Adversarial Learning

专知会员服务

22+阅读 · 2020年4月22日

因果图，Causal Graphs，52页ppt

专知会员服务

238+阅读 · 2020年4月19日