As the first step in automated natural language processing, representing words and sentences is of central importance and has attracted significant research attention. Different approaches, from the early one-hot and bag-of-words representation to more recent distributional dense and sparse representations, were proposed. Despite the successful results that have been achieved, such vectors tend to consist of uninterpretable components and face nontrivial challenge in both memory and computational requirement in practical applications. In this paper, we designed a novel representation model that projects dense word vectors into a higher dimensional space and favors a highly sparse and binary representation of word vectors with potentially interpretable components, while trying to maintain pairwise inner products between original vectors as much as possible. Computationally, our model is relaxed as a symmetric non-negative matrix factorization problem which admits a fast yet effective solution. In a series of empirical evaluations, the proposed model exhibited consistent improvement and high potential in practical applications.
翻译:作为自动自然语言处理的第一步,代表文字和句子具有核心重要性,并引起重要的研究关注。提出了从早期一热和一袋字表示到最近分布密度和稀少的表示方式的不同方法。尽管取得了成功的结果,这种矢量往往包括无法解释的组成部分,在实际应用的记忆和计算要求方面面临着非三进制的挑战。在本文件中,我们设计了一个新型的表示模式,将密集的文字矢量投射到一个更高的维度空间,有利于高度稀少和二进制的、具有潜在可解释组成部分的文字矢量的表示方式,同时尽量在原始矢量之间保持对称的内产品。据计算,我们的模式是一个对称的非负矩阵因子化问题,它承认一种快速有效的解决办法。在一系列经验评估中,拟议的模型在实际应用方面表现出了持续的改进和高度的潜力。