Word embedding is a key component in many downstream applications in processing natural languages. Existing approaches often assume the existence of a large collection of text for learning effective word embedding. However, such a corpus may not be available for some low-resource languages. In this paper, we study how to effectively learn a word embedding model on a corpus with only a few million tokens. In such a situation, the co-occurrence matrix is sparse as the co-occurrences of many word pairs are unobserved. In contrast to existing approaches often only sample a few unobserved word pairs as negative samples, we argue that the zero entries in the co-occurrence matrix also provide valuable information. We then design a Positive-Unlabeled Learning (PU-Learning) approach to factorize the co-occurrence matrix and validate the proposed approaches in four different languages.
翻译:语言嵌入是许多处理自然语言的下游应用中的一个关键组成部分。 现有方法往往假设存在大量用于学习有效语言嵌入的文本集。 然而, 这样的材料可能无法用于某些低资源语言。 在本文中, 我们研究如何有效地学习一个单词嵌入模式, 仅用几百万个符号。 在这种情况下, 共同发生矩阵很少, 因为许多单词对的共发率没有观测到。 与现有方法相比, 通常只是将少数几个未观测到的单词对作为负面样本, 我们争论说, 共同发生的矩阵中的零条目也可以提供有价值的信息 。 然后我们设计一个积极的无标签学习( PU- Learch) 方法, 以四种不同语言将共发式矩阵考虑在内, 并验证拟议的方法 。