Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the Embedded Topic Model (ETM), a generative model of documents that marries traditional topic models with word embeddings. In particular, it models each word with a categorical distribution whose natural parameter is the inner product between a word embedding and an embedding of its assigned topic. To fit the ETM, we develop an efficient amortized variational inference algorithm. The ETM discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation (LDA), in terms of both topic quality and predictive performance.
翻译:主题模型分析文件以学习有意义的文字模式。 但是, 现有主题模型在与大型和重尾词汇合作时未能学习可解释的专题。 为此, 我们开发了嵌入式主题模型(ETM), 这是一种将传统主题模型与词嵌入模式相结合的文件的基因化模型。 特别是, 它用一个绝对的分布模型来模拟每个单词, 其自然参数是嵌入一个单词和嵌入其指定主题之间的内在产品。 为了适应 ETM, 我们开发了一个高效的摊销变异推算算法。 即便在包括稀有文字和停止单词的大型词汇中, ETM 也发现了可解释的专题。 它在主题质量和预测性能方面都超越了现有的文件模型, 如潜在的diriclet分配(LDA) 。