We address two challenges of probabilistic topic modelling in order to better estimate the probability of a word in a given context, i.e., P(word|context): (1) No Language Structure in Context: Probabilistic topic models ignore word order by summarizing a given context as a "bag-of-word" and consequently the semantics of words in the context is lost. The LSTM-LM learns a vector-space representation of each word by accounting for word order in local collocation patterns and models complex characteristics of language (e.g., syntax and semantics), while the TM simultaneously learns a latent representation from the entire document and discovers the underlying thematic structure. We unite two complementary paradigms of learning the meaning of word occurrences by combining a TM (e.g., DocNADE) and a LM in a unified probabilistic framework, named as ctx-DocNADE. (2) Limited Context and/or Smaller training corpus of documents: In settings with a small number of word occurrences (i.e., lack of context) in short text or data sparsity in a corpus of few documents, the application of TMs is challenging. We address this challenge by incorporating external knowledge into neural autoregressive topic models via a language modelling approach: we use word embeddings as input of a LSTM-LM with the aim to improve the word-topic mapping on a smaller and/or short-text corpus. The proposed DocNADE extension is named as ctx-DocNADEe. We present novel neural autoregressive topic model variants coupled with neural LMs and embeddings priors that consistently outperform state-of-the-art generative TMs in terms of generalization (perplexity), interpretability (topic coherence) and applicability (retrieval and classification) over 6 long-text and 8 short-text datasets from diverse domains.
翻译:我们解决了两种概率论主题建模的挑战,目的是为了更好地估计某个词在特定背景下的概率,即P(word_context):(1) 上下文中没有语言结构:概率论主题模型通过将给定环境总结为“字包”从而丧失了上下文的语义的语义。LSTM-LM通过计算本地合用模式和语言复杂特征(例如,语法和语义)的文字顺序来学习每个词的矢量-空间代表。TTM同时学习整个文档的潜值表达法,并发现基本的主题结构。我们把两个互补的模式组合起来,通过将给给给给给特定环境的语义(e.g,DocNADE)和一个统一的稳定性框架中的LM(称为 ctx-DocADE)。(2) 时间论背景和/或较小型文件的缩略式训练:在有少量文字反应的环境下(例如,缺乏上下文),在短的文本或数据模型中,我们通过一个直立式的变式模型,把文字或数据模型中一个挑战性的方法,我们先用一个数字式的变的变式,我们用一个数字式的变的内,一个数字式的变式的变式的变式的变式。