Most of the information on the Internet is represented in the form of microtexts, which are short text snippets like news headlines or tweets. These source of information is abundant and mining this data could uncover meaningful insights. Topic modeling is one of the popular methods to extract knowledge from a collection of documents, nevertheless conventional topic models such as Latent Dirichlet Allocation (LDA) is unable to perform well on short documents, mostly due to the scarcity of word co-occurrence statistics embedded in the data. The objective of our research is to create a topic model which can achieve great performances on microtexts while requiring a small runtime for scalability to large datasets. To solve the lack of information of microtexts, we allow our method to take advantage of word embeddings for additional knowledge of relationships between words. For speed and scalability, we apply Auto-Encoding Variational Bayes, an algorithm that can perform efficient black-box inference in probabilistic models. The result of our work is a novel topic model called Nested Variational Autoencoder which is a distribution that takes into account word vectors and is parameterized by a neural network architecture. For optimization, the model is trained to approximate the posterior distribution of the original LDA model. Experiments show the improvements of our model on microtexts as well as its runtime advantage.
翻译:互联网上的大多数信息都以缩微文本的形式呈现, 它们是短文本片段, 如新闻标题或推文。 这些信息来源是丰富的, 并且挖掘这些数据可以发现有意义的洞察力。 主题模型是从文件收集中提取知识的流行方法之一, 但是像Lientt Dirichlet分配(LDA)这样的传统主题模型无法在短文档上很好地发挥作用, 主要是因为数据中嵌入的单词共存统计数据很少。 我们的研究目标是创建一个主题模型, 可以在微文本上取得伟大的性能, 同时需要小运行时间才能向大数据集进行缩放。 为了解决微文本信息的缺乏, 我们允许我们利用嵌入字的方法来获取对文字关系的额外知识。 为了速度和可缩放性, 我们应用自动编码Variational Bayes, 这个算法可以在预测模型中进行高效的黑箱模型推断。 我们的工作成果是一个新的主题模型, 叫做 Nested Varitional Autencoder, 它的原始模型的缩略缩缩缩缩图分布是原始的缩略图, 它的缩缩图的缩略图的缩图的缩图的缩图的缩图的缩图的缩图的缩图的缩图的缩略图, 它的缩图的缩图的缩图的缩图的缩图的缩图的缩图的缩图的缩图的缩图的缩图的缩图的缩图。