Topic modeling analyzes documents to learn meaningful patterns of words. Dynamic topic models capture how these patterns vary over time for a set of documents that were collected over a large time span. We develop the dynamic embedded topic model (D-ETM), a generative model of documents that combines dynamic latent Dirichlet allocation (D-LDA) and word embeddings. The D-ETM models each word with a categorical distribution whose parameter is given by the inner product between the word embedding and an embedding representation of its assigned topic at a particular time step. The word embeddings allow the D-ETM to generalize to rare words. The D-ETM learns smooth topic trajectories by defining a random walk prior over the embeddings of the topics. We fit the D-ETM using structured amortized variational inference. On a collection of United Nations debates, we find that the D-ETM learns interpretable topics and outperforms D-LDA in terms of both topic quality and predictive performance.
翻译:主题模型分析文件以学习有意义的文字模式。 动态主题模型捕捉这些模式如何随时间变化, 用于收集大量时间段的一套文件。 我们开发了动态嵌入主题模型(D- ETM), 这是一种将动态潜伏Drichlet分配(D- LDA)和字嵌入式组合在一起的文件的基因模型。 D- ETM 模型每个单词都有绝对分布, 其参数由内部产品在单词嵌入和特定时间步骤中嵌入指定主题之间的参数给定。 单词嵌入使 D- ETM 能够概括为稀有的单词。 D- ETM 通过在专题嵌入式之前的随机行走来学习平稳的专题轨迹。 我们用结构化的分解变法推理来适应 D- ETM 。 在收集联合国辩论时, 我们发现 D- ETM 学习可解释的专题, 并且从主题质量和预测性表现上超越D- LDA。