Analyzing short texts infers discriminative and coherent latent topics that is a critical and fundamental task since many real-world applications require semantic understanding of short texts. Traditional long text topic modeling algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. Therefore, short text topic modeling has already attracted much attention from the machine learning research community in recent years, which aims at overcoming the problem of sparseness in short texts. In this survey, we conduct a comprehensive review of various short text topic modeling techniques proposed in the literature. We present three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks. We develop the first comprehensive open-source library, called STTM, for use in Java that integrates all surveyed algorithms within a unified interface, benchmark datasets, to facilitate the expansion of new methods in this research field. Finally, we evaluate these state-of-the-art methods on many real-world datasets and compare their performance against one another and versus long text topic modeling algorithm.
翻译:分析短文分析短文认为具有歧视性和一致性的潜在专题是一项关键和根本的任务,因为许多现实世界应用都需要对短文进行语义理解。传统的长文主题模式算法(如PLSA和LDA)以单词共生关系为基础,无法很好地解决这个问题,因为短文只提供极有限的单词共生关系信息。因此,短文主题模式近年来已经吸引了机器学习研究界的极大关注,其目的是克服短文中稀少问题。在这次调查中,我们全面审查了文献中提议的各种短文主题模拟技术。我们介绍了基于Drichlet多语混合、全球单词共生关系和自我聚合的三类方法,并举例说明了每个类别中的代表性方法,并分析了它们在不同任务中的绩效。我们开发了第一个综合的开放源图书馆,称为STTM,目的是将所有调查过的算法纳入一个统一的界面、基准数据集中。我们根据文献集对各种短文系模式的扩展了这一研究领域的新方法,并对照另一个专题对另一个数据进行对比。最后,我们评估了这些状态的文本模式,并比较了另一个领域。