We address two challenges in topic models: (1) Context information around words helps in determining their actual meaning, e.g., "networks" used in the contexts "artificial neural networks" vs. "biological neuron networks". Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. The proposed model is named as iDocNADE. (2) Due to the small number of word occurrences (i.e., lack of context) in short text and data sparsity in a corpus of few documents, the application of topic models is challenging on such texts. Therefore, we propose a simple and efficient way of incorporating external knowledge into neural autoregressive topic models: we use embeddings as a distributional prior. The proposed variants are named as DocNADEe and iDocNADEe. We present novel neural autoregressive topic model variants that consistently outperform state-of-the-art generative topic models in terms of generalization, interpretability (topic coherence) and applicability (retrieval and classification) over 7 long-text and 8 short-text datasets from diverse domains.
翻译:我们在专题模型中处理两个挑战:(1) 文字周围的背景信息有助于确定其实际含义,例如“人工神经网络”和“生物神经网络”背景下使用的“网络”与“生物神经网络”中所用的“网络”等。 产生的主题模型推导出专题词分布,不考虑或不考虑什么背景,这里,我们扩展了神经自动递增主题模型,以以语言模型方式在文件中利用文字周围的全部背景信息。 拟议的模型命名为iDocNADE。 (2) 由于在少数文件的简短文本和数据紧张背景下使用的“网络”数量很少,专题模型的应用对这些文本具有挑战性。 因此,我们提出了将外部知识纳入神经自动递减专题模型的简单而有效的方法:我们使用嵌入作为分发之前的分布式。 拟议的变式被称为DocNADEe和iDocNADEe。 我们提出了新的神经自动递增主题模型变体模型,这些变体持续超越了短期状态的基因分析主题模型,在8个通用数据分类、多变式(通用数据分类、多式、多式、多式、多式、多式、多式、多式、多式、多式、多式)的通用数据分类和多式模型。