Word2vec is one of the most used algorithms to generate word embeddings because of a good mix of efficiency, quality of the generated representations and cognitive grounding. However, word meaning is not static and depends on the context in which words are used. Differences in word meaning that depends on time, location, topic, and other factors, can be studied by analyzing embeddings generated from different corpora in collections that are representative of these factors. For example, language evolution can be studied using a collection of news articles published in different time periods. In this paper, we present a general framework to support cross-corpora language studies with word embeddings, where embeddings generated from different corpora can be compared to find correspondences and differences in meaning across the corpora. CADE is the core component of our framework and solves the key problem of aligning the embeddings generated from different corpora. In particular, we focus on providing solid evidence about the effectiveness, generality, and robustness of CADE. To this end, we conduct quantitative and qualitative experiments in different domains, from temporal word embeddings to language localization and topical analysis. The results of our experiments suggest that CADE achieves state-of-the-art or superior performance on tasks where several competing approaches are available, yet providing a general method that can be used in a variety of domains. Finally, our experiments shed light on the conditions under which the alignment is reliable, which substantially depends on the degree of cross-corpora vocabulary overlap.
翻译:Word2vec 是用来生成字嵌入的最常用算法之一, 因为它具有良好的效率、 生成的演示和认知基础质量的混合。 但是, 字的意思并不是静态的, 取决于使用词的背景。 字的含义差异取决于时间、 地点、 主题 和其他因素, 可以通过分析不同组合在代表这些因素的收藏中产生的嵌入来研究。 例如, 语言演进可以使用在不同时期发表的新闻文章集来研究。 在本文中, 我们提出了一个总体框架, 支持跨公司语言研究, 包括单词嵌入, 并用词嵌入来支持跨公司的语言研究。 单词的含义并不是静态的, 取决于使用词使用的背景。 CADE 是我们框架的核心组成部分, 解决了不同组合在代表这些因素的收藏中产生的嵌入的关键问题。 特别是, 我们侧重于提供关于CADE有效性、 普遍性和稳健性的新闻文章。 我们为此在不同的领域进行定量和定性实验, 从时间单词嵌入语言嵌入语言嵌入本地和主题上的轨迹, 而在总体的实验中, 也可以在一定的层次上, 提供我们的业绩实验结果, 。 在一般的层次上, 的层次上, 也可以化的实验结果,, 在一般的层次上, 也可以化和主题的实验的结果是, 的层次的层次的层次上, 。