主题模型,顾名思义,就是对文字中隐含主题的一种建模方法。“苹果”这个词的背后既包含是苹果公司这样一个主题,也包括了水果的主题。   在这里,我们先定义一下主题究竟是什么。主题就是一个概念、一个方面。它表现为一系列相关的词语。比如一个文章如果涉及到“百度”这个主题,那么“中文搜索”、“李彦宏”等词语就会以较高的频率出现,而如果涉及到“IBM”这个主题,那么“笔记本”等就会出现的很频繁。如果用数学来描述一下的话,主题就是词汇表上词语的条件概率分布 。与主题关系越密切的词语,它的条件概率越大,反之则越小。

    Steady progress has been made in abstractive summarization with attention-based sequence-to-sequence learning models. In this paper, we propose a new decoder where the output summary is generated by conditioning on both the input text and the latent topics of the document. The latent topics, identified by a topic model such as LDA, reveals more global semantic information that can be used to bias the decoder to generate words. In particular, they enable the decoder to have access to additional word co-occurrence statistics captured at document corpus level. We empirically validate the advantage of the proposed approach on both the CNN/Daily Mail and the WikiHow datasets. Concretely, we attain strongly improved ROUGE scores when compared to state-of-the-art models.

    点赞 0
    阅读0+

    The task of discovering topics in text corpora has been dominated by Latent Dirichlet Allocation and other Topic Models for over a decade. In order to apply these approaches to massive text corpora, the vocabulary needs to be reduced considerably and large computer clusters and/or GPUs are typically required. Moreover, the number of topics must be provided beforehand but this depends on the corpus characteristics and it is often difficult to estimate, especially for massive text corpora. Unfortunately, both topic quality and time complexity are sensitive to this choice. This paper describes an alternative approach to discover topics based on Min-Hashing, which can handle massive text corpora and large vocabularies using modest computer hardware and does not require to fix the number of topics in advance. The basic idea is to generate multiple random partitions of the corpus vocabulary to find sets of highly co-occurring words, which are then clustered to produce the final topics. In contrast to probabilistic topic models where topics are distributions over the complete vocabulary, the topics discovered by the proposed approach are sets of highly co-occurring words. Interestingly, these topics underlie various thematics with different levels of granularity. An extensive qualitative and quantitative evaluation using the 20 Newsgroups (18K), Reuters (800K), Spanish Wikipedia (1M), and English Wikipedia (5M) corpora shows that the proposed approach is able to consistently discover meaningful and coherent topics. Remarkably, the time complexity of the proposed approach is linear with respect to corpus and vocabulary size; a non-parallel implementation was able to discover topics from the entire English edition of Wikipedia with over 5 million documents and 1 million words in less than 7 hours.

    点赞 0
    阅读1+

    Latent Dirichlet Allocation (LDA) model is a famous model in the topic model field, it has been studied for years due to its extensive application value in industry and academia. However, the mathematical derivation of LDA model is challenging and difficult, which makes it difficult for the beginners to learn. To help the beginners in learning LDA, this book analyzes the mathematical derivation of LDA in detail, and it also introduces all the knowledge background to make it easy for beginners to understand. Thus, this book contains the author's unique insights. It should be noted that this book is written in Chinese.

    点赞 0
    阅读0+

    We present a framework that allows users to incorporate the semantics of their domain knowledge for topic model refinement while remaining model-agnostic. Our approach enables users to (1) understand the semantic space of the model, (2) identify regions of potential conflicts and problems, and (3) readjust the semantic relation of concepts based on their understanding, directly influencing the topic modeling. These tasks are supported by an interactive visual analytics workspace that uses word-embedding projections to define concept regions which can then be refined. The user-refined concepts are independent of a particular document collection and can be transferred to related corpora. All user interactions within the concept space directly affect the semantic relations of the underlying vector space model, which, in turn, change the topic modeling. In addition to direct manipulation, our system guides the users' decision-making process through recommended interactions that point out potential improvements. This targeted refinement aims at minimizing the feedback required for an efficient human-in-the-loop process. We confirm the improvements achieved through our approach in two user studies that show topic model quality improvements through our visual knowledge externalization and learning process.

    点赞 0
    阅读0+

    Image paragraph generation is the task of producing a coherent story (usually a paragraph) that describes the visual content of an image. The problem nevertheless is not trivial especially when there are multiple descriptive and diverse gists to be considered for paragraph generation, which often happens in real images. A valid question is how to encapsulate such gists/topics that are worthy of mention from an image, and then describe the image from one topic to another but holistically with a coherent structure. In this paper, we present a new design --- Convolutional Auto-Encoding (CAE) that purely employs convolutional and deconvolutional auto-encoding framework for topic modeling on the region-level features of an image. Furthermore, we propose an architecture, namely CAE plus Long Short-Term Memory (dubbed as CAE-LSTM), that novelly integrates the learnt topics in support of paragraph generation. Technically, CAE-LSTM capitalizes on a two-level LSTM-based paragraph generation framework with attention mechanism. The paragraph-level LSTM captures the inter-sentence dependency in a paragraph, while sentence-level LSTM is to generate one sentence which is conditioned on each learnt topic. Extensive experiments are conducted on Stanford image paragraph dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, CAE-LSTM increases CIDEr performance from 20.93% to 25.15%.

    点赞 0
    阅读0+

    Food choices are personal and complex and have a significant impact on our long-term health and quality of life. By helping users to make informed and satisfying decisions, Recommender Systems (RS) have the potential to support users in making healthier food choices. Intelligent users-modeling is a key challenge in achieving this potential. This paper investigates Ensemble Topic Modelling (EnsTM) based Feature Identification techniques for efficient user-modeling and recipe recommendation. It builds on findings in EnsTM to propose a reduced data representation format and a smart user-modeling strategy that makes capturing user-preference fast, efficient and interactive. This approach enables personalization, even in a cold-start scenario. This paper proposes two different EnsTM based and one Hybrid EnsTM based recommenders. We compared all three EnsTM based variations through a user study with 48 participants, using a large-scale,real-world corpus of 230,876 recipes, and compare against a conventional Content Based (CB) approach. EnsTM based recommenders performed significantly better than the CB approach. Besides acknowledging multi-domain contents such as taste, demographics and cost, our proposed approach also considers user's nutritional preference and assists them finding recipes under diverse nutritional categories. Furthermore, it provides excellent coverage and enables implicit understanding of user's food practices. Subsequent analysis also exposed correlation between certain features and a healthier lifestyle.

    点赞 0
    阅读0+

    Social media platforms such as Twitter are filled with social spambots. Detecting these malicious accounts is essential, yet challenging, as they continually evolve and evade traditional detection techniques. In this work, we propose VASSL, a visual analytics system that assists in the process of detecting and labeling spambots. Our tool enhances the performance and scalability of manual labeling by providing multiple connected views and utilizing dimensionality reduction, sentiment analysis and topic modeling techniques, which offer new insights that enable the identification of spambots. The system allows users to select and analyze groups of accounts in an interactive manner, which enables the detection of spambots that may not be identified when examined individually. We conducted a user study to objectively evaluate the performance of VASSL users, as well as capturing subjective opinions about the usefulness and the ease of use of the tool.

    点赞 0
    阅读0+

    Topic modeling is commonly used to analyze and understand large document collections. However, in practice, users want to focus on specific aspects or "targets" rather than the entire corpus. For example, given a large collection of documents, users may want only a smaller subset which more closely aligns with their interests, tasks, and domains. In particular, our paper focuses on large-scale document retrieval with high recall where any missed relevant documents can be critical. A simple keyword matching search is generally not effective nor efficient as 1) it is difficult to find a list of keyword queries that can cover the documents of interest before exploring the dataset, 2) some documents may not contain the exact keywords of interest but may still be highly relevant, and 3) some words have multiple meanings, which would result in irrelevant documents included in the retrieved subset. In this paper, we present TopicSifter, a visual analytics system for interactive search space reduction. Our system utilizes targeted topic modeling based on nonnegative matrix factorization and allows users to give relevance feedback in order to refine their target and guide the topic modeling to the most relevant results.

    点赞 0
    阅读0+

    We propose a novel neural topic model in the Wasserstein autoencoders (WAE) framework. Unlike existing variational autoencoder based models, we directly enforce Dirichlet prior on the latent document-topic vectors. We exploit the structure of the latent space and apply a suitable kernel in minimizing the Maximum Mean Discrepancy (MMD) to perform distribution matching. We discover that MMD performs much better than the Generative Adversarial Network (GAN) in matching high dimensional Dirichlet distribution. We further discover that incorporating randomness in the encoder output during training leads to significantly more coherent topics. To measure the diversity of the produced topics, we propose a simple topic uniqueness metric. Together with the widely used coherence measure NPMI, we offer a more wholistic evaluation of topic quality. Experiments on several real datasets show that our model produces significantly better topics than existing topic models.

    点赞 0
    阅读0+

    Traces of user interactions with a software system, captured in production, are commonly used as an input source for user experience testing. In this paper, we present an alternative use, introducing a novel approach of modeling user interaction traces enriched with another type of data gathered in production - software fault reports consisting of software exceptions and stack traces. The model described in this paper aims to improve developers' comprehension of the circumstances surrounding a specific software exception and can highlight specific user behaviors that lead to a high frequency of software faults. Modeling the combination of interaction traces and software crash reports to form an interpretable and useful model is challenging due to the complexity and variance in the combined data source. Therefore, we propose a probabilistic unsupervised learning approach, adapting the Nested Hierarchical Dirichlet Process, which is a Bayesian non-parametric topic model commonly applied to natural language data. This model infers a tree of topics, each of whom describes a set of commonly co-occurring commands and exceptions. The topic tree can be interpreted hierarchically to aid in categorizing the numerous types of exceptions and interactions. We apply the proposed approach to large scale datasets collected from the ABB RobotStudio software application, and evaluate it both numerically and with a small survey of the RobotStudio developers.

    点赞 0
    阅读0+

    Topic modeling analyzes documents to learn meaningful patterns of words. Dynamic topic models capture how these patterns vary over time for a set of documents that were collected over a large time span. We develop the dynamic embedded topic model (D-ETM), a generative model of documents that combines dynamic latent Dirichlet allocation (D-LDA) and word embeddings. The D-ETM models each word with a categorical distribution whose parameter is given by the inner product between the word embedding and an embedding representation of its assigned topic at a particular time step. The word embeddings allow the D-ETM to generalize to rare words. The D-ETM learns smooth topic trajectories by defining a random walk prior over the embeddings of the topics. We fit the D-ETM using structured amortized variational inference. On a collection of United Nations debates, we find that the D-ETM learns interpretable topics and outperforms D-LDA in terms of both topic quality and predictive performance.

    点赞 0
    阅读1+

    Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the Embedded Topic Model (ETM), a generative model of documents that marries traditional topic models with word embeddings. In particular, it models each word with a categorical distribution whose natural parameter is the inner product between a word embedding and an embedding of its assigned topic. To fit the ETM, we develop an efficient amortized variational inference algorithm. The ETM discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation (LDA), in terms of both topic quality and predictive performance.

    点赞 0
    阅读0+

    Due to the manifold ranking method has a significant effect on the ranking of unknown data based on known data by using a weighted network, many researchers use the manifold ranking method to solve the document summarization task. However, their models only consider the original features but ignore the semantic features of sentences when they construct the weighted networks for the manifold ranking method. To solve this problem, we proposed two improved models based on the manifold ranking method. One is combining the topic model and manifold ranking method (JTMMR) to solve the document summarization task. This model not only uses the original feature, but also uses the semantic feature to represent the document, which can improve the accuracy of the manifold ranking method. The other one is combining the lifelong topic model and manifold ranking method (JLTMMR). On the basis of the JTMMR, this model adds the constraint of knowledge to improve the quality of the topic. At the same time, we also add the constraint of the relationship between documents to dig out a better document semantic features. The JTMMR model can improve the effect of the manifold ranking method by using the better semantic feature. Experiments show that our models can achieve a better result than other baseline models for multi-document summarization task. At the same time, our models also have a good performance on the single document summarization task. After combining with a few basic surface features, our model significantly outperforms some model based on deep learning in recent years. After that, we also do an exploring work for lifelong machine learning by analyzing the effect of adding feedback. Experiments show that the effect of adding feedback to our model is significant.

    点赞 0
    阅读1+

    Objectives To test the feasibility of using Twitter data to assess determinants of consumers' health behavior towards Human papillomavirus (HPV) vaccination informed by the Integrated Behavior Model (IBM). Methods We used three Twitter datasets spanning from 2014 to 2018. We preprocessed and geocoded the tweets, and then built a rule-based model that classified each tweet into either promotional information or consumers' discussions. We applied topic modeling to discover major themes, and subsequently explored the associations between the topics learned from consumers' discussions and the responses of HPV-related questions in the Health Information National Trends Survey (HINTS). Results We collected 2,846,495 tweets and analyzed 335,681 geocoded tweets. Through topic modeling, we identified 122 high-quality topics. The most discussed consumer topic is "cervical cancer screening"; while in promotional tweets, the most popular topic is to increase awareness of "HPV causes cancer". 87 out of the 122 topics are correlated between promotional information and consumers' discussions. Guided by IBM, we examined the alignment between our Twitter findings and the results obtained from HINTS. 35 topics can be mapped to HINTS questions by keywords, 112 topics can be mapped to IBM constructs, and 45 topics have statistically significant correlations with HINTS responses in terms of geographic distributions. Conclusion Not only mining Twitter to assess consumers' health behaviors can obtain results comparable to surveys but can yield additional insights via a theory-driven approach. Limitations exist, nevertheless, these encouraging results impel us to develop innovative ways of leveraging social media in the changing health communication landscape.

    点赞 0
    阅读0+

    In the probabilistic topic models, the quantity of interest---a low-rank matrix consisting of topic vectors---is hidden in the text corpus matrix, masked by noise, and Singular Value Decomposition (SVD) is a potentially useful tool for learning such a matrix. However, different rows and columns of the matrix are usually in very different scales and the connection between this matrix and the singular vectors of the text corpus matrix are usually complicated and hard to spell out, so how to use SVD for learning topic models faces challenges. We overcome the challenges by introducing a proper Pre-SVD normalization of the text corpus matrix and a proper column-wise scaling for the matrix of interest, and by revealing a surprising Post-SVD low-dimensional {\it simplex} structure. The simplex structure, together with the Pre-SVD normalization and column-wise scaling, allows us to conveniently reconstruct the matrix of interest, and motivates a new SVD-based approach to learning topic models. We show that under the popular probabilistic topic model \citep{hofmann1999}, our method has a faster rate of convergence than existing methods in a wide variety of cases. In particular, for cases where documents are long or $n$ is much larger than $p$, our method achieves the optimal rate. At the heart of the proofs is a tight element-wise bound on singular vectors of a multinomially distributed data matrix, which do not exist in literature and we have to derive by ourself. We have applied our method to two data sets, Associated Process (AP) and Statistics Literature Abstract (SLA), with encouraging results. In particular, there is a clear simplex structure associated with the SVD of the data matrices, which largely validates our discovery.

    点赞 0
    阅读0+
子主题
Top