Continuous Bag of Words (CBOW) is a powerful text embedding method. Due to its strong capabilities to encode word content, CBOW embeddings perform well on a wide range of downstream tasks while being efficient to compute. However, CBOW is not capable of capturing the word order. The reason is that the computation of CBOW's word embeddings is commutative, i.e., embeddings of XYZ and ZYX are the same. In order to address this shortcoming, we propose a learning algorithm for the Continuous Matrix Space Model, which we call Continual Multiplication of Words (CMOW). Our algorithm is an adaptation of word2vec, so that it can be trained on large quantities of unlabeled text. We empirically show that CMOW better captures linguistic properties, but it is inferior to CBOW in memorizing word content. Motivated by these findings, we propose a hybrid model that combines the strengths of CBOW and CMOW. Our results show that the hybrid CBOW-CMOW-model retains CBOW's strong ability to memorize word content while at the same time substantially improving its ability to encode other linguistic information by 8%. As a result, the hybrid also performs better on 8 out of 11 supervised downstream tasks with an average improvement of 1.2%.
翻译:连续的 Word Bag (CBOW) 是一种强大的嵌入文本方法 。 由于 CBOW 的嵌入功能对字内容进行编码的强大能力, CBOW 的嵌入功能在一系列广泛的下游任务上表现良好, 同时能够高效地进行计算。 但是, CBOW 无法捕捉字顺序。 原因是计算 CBOW 的字嵌入是通俗的, 也就是说, 嵌入 XYZ 和 ZYX 是相同的。 为了解决这一缺陷, 我们为连续的矩阵空间模型( 我们称之为 CMOW ) 提议了一个学习算法。 我们的算法是调换Word2vec, 这样它就可以在大量无标签的文本上接受培训。 我们从经验上表明 CBOW 更好地捕捉到语言内容的特性, 但是由于这些发现, CBOW 和 CMOW 的优势, 我们提议了一个混合模型, 结合CBOW 空间模型的优势。 我们的计算结果显示, 混合的 CBOW- CM- 建模 模型在改进了它的平均8 能力上, 也保留了它的平均版本 改进了自己在8 格式上 改进了它的平均 的 的 。