主题模型,顾名思义,就是对文字中隐含主题的一种建模方法。“苹果”这个词的背后既包含是苹果公司这样一个主题,也包括了水果的主题。   在这里,我们先定义一下主题究竟是什么。主题就是一个概念、一个方面。它表现为一系列相关的词语。比如一个文章如果涉及到“百度”这个主题,那么“中文搜索”、“李彦宏”等词语就会以较高的频率出现,而如果涉及到“IBM”这个主题,那么“笔记本”等就会出现的很频繁。如果用数学来描述一下的话,主题就是词汇表上词语的条件概率分布 。与主题关系越密切的词语,它的条件概率越大,反之则越小。

    We introduce a dynamic generative model, Bayesian allocation model (BAM), which establishes explicit connections between nonnegative tensor factorization (NTF), graphical models of discrete probability distributions and their Bayesian extensions, and the topic models such as the latent Dirichlet allocation. BAM is based on a Poisson process, whose events are marked by using a Bayesian network, where the conditional probability tables of this network are then integrated out analytically. We show that the resulting marginal process turns out to be a Polya urn, an integer valued self-reinforcing process. This urn processes, which we name a Polya-Bayes process, obey certain conditional independence properties that provide further insight about the nature of NTF. These insights also let us develop space efficient simulation algorithms that respect the potential sparsity of data: we propose a class of sequential importance sampling algorithms for computing NTF and approximating their marginal likelihood, which would be useful for model selection. The resulting methods can also be viewed as a model scoring method for topic models and discrete Bayesian networks with hidden variables. The new algorithms have favourable properties in the sparse data regime when contrasted with variational algorithms that become more accurate when the total sum of the elements of the observed tensor goes to infinity. We illustrate the performance on several examples and numerically study the behaviour of the algorithms for various data regimes.

    点赞 0
    阅读1+

    Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.

    点赞 0
    阅读0+

    The main goal of this paper is to explore latent topic analysis (LTA), in the context of quantum information retrieval. LTA is a valuable technique for document analysis and representation, which has been extensively used in information retrieval and machine learning. Different LTA techniques have been proposed, some based on geometrical modeling (such as latent semantic analysis, LSA) and others based on a strong statistical foundation. However, these two different approaches are not usually mixed. Quantum information retrieval has the remarkable virtue of combining both geometry and probability in a common principled framework. We built on this quantum framework to propose a new LTA method, which has a clear geometrical motivation but also supports a well-founded probabilistic interpretation. An initial exploratory experimentation was performed on three standard data sets. The results show that the proposed method outperforms LSA on two of the three datasets. These results suggests that the quantum-motivated representation is an alternative for geometrical latent topic modeling worthy of further exploration.

    点赞 0
    阅读1+

    In recent years, we have been faced with a series of natural disasters causing a tremendous amount of financial, environmental, and human losses. The unpredictable nature of natural disasters' behavior makes it hard to have a comprehensive situational awareness (SA) to support disaster management. Using opinion surveys is a traditional approach to analyze public concerns during natural disasters; however, this approach is limited, expensive, and time-consuming. Luckily the advent of social media has provided scholars with an alternative means of analyzing public concerns. Social media enable users (people) to freely communicate their opinions and disperse information regarding current events including natural disasters. This research emphasizes the value of social media analysis and proposes an analytical framework: Twitter Situational Awareness (TwiSA). This framework uses text mining methods including sentiment analysis and topic modeling to create a better SA for disaster preparedness, response, and recovery. TwiSA has also effectively deployed on a large number of tweets and tracks the negative concerns of people during the 2015 South Carolina flood.

    点赞 0
    阅读0+

    Scientific documents rely on both mathematics and text to communicate ideas. Inspired by the topical correspondence between mathematical equations and word contexts observed in scientific texts, we propose a novel topic model that jointly generates mathematical equations and their surrounding text (TopicEq). Using an extension of the correlated topic model, the context is generated from a mixture of latent topics, and the equation is generated by an RNN that depends on the latent topic activations. To experiment with this model, we create a corpus of 400K equation-context pairs extracted from a range of scientific articles from arXiv, and fit the model using a variational autoencoder approach. Experimental results show that this joint model significantly outperforms existing topic models and equation models for scientific texts. Moreover, we qualitatively show that the model effectively captures the relationship between topics and mathematics, enabling novel applications such as topic-aware equation generation, equation topic inference, and topic-aware alignment of mathematical symbols and words.

    点赞 0
    阅读1+

    Characterizing the exact asymptotic distributions of high-dimensional eigenvectors for large structured random matrices poses important challenges yet can provide useful insights into a range of applications. To this end, in this paper we introduce a general framework of asymptotic theory of eigenvectors (ATE) for large structured symmetric random matrices with heterogeneous variances, and establish the asymptotic properties of the spiked eigenvectors and eigenvalues for the scenario of the generalized Wigner matrix noise, where the mean matrix is assumed to have the low-rank structure. Under some mild regularity conditions, we provide the asymptotic expansions for the spiked eigenvalues and show that they are asymptotically normal after some normalization. For the spiked eigenvectors, we establish novel asymptotic expansions for the general linear combination and further show that it is asymptotically normal after some normalization, where the weight vector can be arbitrary. We also provide a more general asymptotic theory for the spiked eigenvectors using the bilinear form. Simulation studies verify the validity of our new theoretical results. Our family of models encompasses many popularly used ones such as the stochastic block models with or without overlapping communities for network analysis and the topic models for text analysis, and our general theory can be exploited for statistical inference in these large-scale applications.

    点赞 0
    阅读0+

    Aspect-based Opinion Summary (AOS), consisting of aspect discovery and sentiment classification steps, has recently been emerging as one of the most crucial data mining tasks in e-commerce systems. Along this direction, the LDA-based model is considered as a notably suitable approach, since this model offers both topic modeling and sentiment classification. However, unlike traditional topic modeling, in the context of aspect discovery it is often required some initial seed words, whose prior knowledge is not easy to be incorporated into LDA models. Moreover, LDA approaches rely on sampling methods, which need to load the whole corpus into memory, making them hardly scalable. In this research, we study an alternative approach for AOS problem, based on Autoencoding Variational Inference (AVI). Firstly, we introduce the Autoencoding Variational Inference for Aspect Discovery (AVIAD) model, which extends the previous work of Autoencoding Variational Inference for Topic Models (AVITM) to embed prior knowledge of seed words. This work includes enhancement of the previous AVI architecture and also modification of the loss function. Ultimately, we present the Autoencoding Variational Inference for Joint Sentiment/Topic (AVIJST) model. In this model, we substantially extend the AVI model to support the JST model, which performs topic modeling for corresponding sentiment. The experimental results show that our proposed models enjoy higher topic coherent, faster convergence time and better accuracy on sentiment classification, as compared to their LDA-based counterparts.

    点赞 0
    阅读0+

    Aspect-based Opinion Summary (AOS), consisting of aspect discovery and sentiment classification steps, has recently been emerging as one of the most crucial data mining tasks in e-commerce systems. Along this direction, the LDA-based model is considered as a notably suitable approach, since this model offers both topic modeling and sentiment classification. However, unlike traditional topic modeling, in the context of aspect discovery it is often required some initial seed words, whose prior knowledge is not easy to be incorporated into LDA models. Moreover, LDA approaches rely on sampling methods, which need to load the whole corpus into memory, making them hardly scalable. In this research, we study an alternative approach for AOS problem, based on Autoencoding Variational Inference (AVI). Firstly, we introduce the Autoencoding Variational Inference for Aspect Discovery (AVIAD) model, which extends the previous work of Autoencoding Variational Inference for Topic Models (AVITM) to embed prior knowledge of seed words. This work includes enhancement of the previous AVI architecture and also modification of the loss function. Ultimately, we present the Autoencoding Variational Inference for Joint Sentiment/Topic (AVIJST) model. In this model, we substantially extend the AVI model to support the JST model, which performs topic modeling for corresponding sentiment. The experimental results show that our proposed models enjoy higher topic coherent, faster convergence time and better accuracy on sentiment classification, as compared to their LDA-based counterparts.

    点赞 0
    阅读0+

    Massive volumes of data continuously generated on social platforms have become an important information source for users. A primary method to obtain fresh and valuable information from social streams is \emph{social search}. Although there have been extensive studies on social search, existing methods only focus on the \emph{relevance} of query results but ignore the \emph{representativeness}. In this paper, we propose a novel Semantic and Influence aware $k$-Representative ($k$-SIR) query for social streams based on topic modeling. Specifically, we consider that both user queries and elements are represented as vectors in the topic space. A $k$-SIR query retrieves a set of $k$ elements with the maximum \emph{representativeness} over the sliding window at query time w.r.t. the query vector. The representativeness of an element set comprises both semantic and influence scores computed by the topic model. Subsequently, we design two approximation algorithms, namely \textsc{Multi-Topic ThresholdStream} (MTTS) and \textsc{Multi-Topic ThresholdDescend} (MTTD), to process $k$-SIR queries in real-time. Both algorithms leverage the ranked lists maintained on each topic for $k$-SIR processing with theoretical guarantees. Extensive experiments on real-world datasets demonstrate the effectiveness of $k$-SIR query compared with existing methods as well as the efficiency and scalability of our proposed algorithms for $k$-SIR processing.

    点赞 0
    阅读0+

    In the era of big science, countries allocate big research and development budgets to large scientific facilities that boost collaboration and research capability. A nuclear fusion device called the "tokamak" is a source of great interest for many countries because it ideally generates sustainable energy expected to solve the energy crisis in the future. Here, to explore the scientific effects of tokamaks, we map a country's research capability in nuclear fusion research with normalized revealed comparative advantage on five topical clusters -- material, plasma, device, diagnostics, and simulation -- detected through a dynamic topic model. Our approach captures not only the growth of China, India, and the Republic of Korea but also the decline of Canada, Japan, Sweden, and the Netherlands. Time points of their rise and fall are related to tokamak operation, highlighting the importance of large facilities in big science. The gravity model points out that two countries collaborate less in device, diagnostics, and plasma research if they have comparative advantages in different topics. This relation is a unique feature of nuclear fusion compared to other science fields. Our results can be used and extended when building national policies for big science.

    点赞 0
    阅读0+

    Privacy is a major issue in learning from distributed data. Recently the cryptographic literature has provided several tools for this task. However, these tools either reduce the quality/accuracy of the learning algorithm---e.g., by adding noise---or they incur a high performance penalty and/or involve trusting external authorities. We propose a methodology for {\sl private distributed machine learning from light-weight cryptography} (in short, PD-ML-Lite). We apply our methodology to two major ML algorithms, namely non-negative matrix factorization (NMF) and singular value decomposition (SVD). Our resulting protocols are communication optimal, achieve the same accuracy as their non-private counterparts, and satisfy a notion of privacy---which we define---that is both intuitive and measurable. Our approach is to use lightweight cryptographic protocols (secure sum and normalized secure sum) to build learning algorithms rather than wrap complex learning algorithms in a heavy-cost MPC framework. We showcase our algorithms' utility and privacy on several applications: for NMF we consider topic modeling and recommender systems, and for SVD, principal component regression, and low rank approximation.

    点赞 0
    阅读0+

    We address two challenges in topic models: (1) Context information around words helps in determining their actual meaning, e.g., "networks" used in the contexts "artificial neural networks" vs. "biological neuron networks". Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. The proposed model is named as iDocNADE. (2) Due to the small number of word occurrences (i.e., lack of context) in short text and data sparsity in a corpus of few documents, the application of topic models is challenging on such texts. Therefore, we propose a simple and efficient way of incorporating external knowledge into neural autoregressive topic models: we use embeddings as a distributional prior. The proposed variants are named as DocNADEe and iDocNADEe. We present novel neural autoregressive topic model variants that consistently outperform state-of-the-art generative topic models in terms of generalization, interpretability (topic coherence) and applicability (retrieval and classification) over 7 long-text and 8 short-text datasets from diverse domains.

    点赞 0
    阅读0+

    In this paper, we derive the asymptotic behavior of the Bayesian generalization error in the topic model. By theoretical analysis of the maximum pole of the zeta function (real log canonical threshold) of the topic model, we obtain an upper bound of the Bayesian generalization error and the free energy in the topic model and the stochastic matrix factorization (SMF; it can be regarded as a restriction of the non-negative matrix factorization). The results show that the generalization error in the topic model and SMF becomes smaller than regular statistical models if Bayesian inference is attained.

    点赞 0
    阅读0+

    This paper proposes Dirichlet Variational Autoencoder (DirVAE) using a Dirichlet prior for a continuous latent variable that exhibits the characteristic of the categorical probabilities. To infer the parameters of DirVAE, we utilize the stochastic gradient method by approximating the Gamma distribution, which is a component of the Dirichlet distribution, with the inverse Gamma CDF approximation. Additionally, we reshape the component collapsing issue by investigating two problem sources, which are decoder weight collapsing and latent value collapsing, and we show that DirVAE has no component collapsing; while Gaussian VAE exhibits the decoder weight collapsing and Stick-Breaking VAE shows the latent value collapsing. The experimental results show that 1) DirVAE models the latent representation result with the best log-likelihood compared to the baselines; and 2) DirVAE produces more interpretable latent values with no collapsing issues which the baseline models suffer from. Also, we show that the learned latent representation from the DirVAE achieves the best classification accuracy in the semi-supervised and the supervised classification tasks on MNIST, OMNIGLOT, and SVHN compared to the baseline VAEs. Finally, we demonstrated that the DirVAE augmented topic models show better performances in most cases.

    点赞 0
    阅读0+

    We present Variational Aspect-based Latent Topic Allocation (VALTA), a family of autoencoding topic models that learn aspect-based representations of reviews. VALTA defines a user-item encoder that maps bag-of-words vectors for combined reviews associated with each paired user and item onto structured embeddings, which in turn define per-aspect topic weights. We model individual reviews in a structured manner by inferring an aspect assignment for each sentence in a given review, where the per-aspect topic weights obtained by the user-item encoder serve to define a mixture over topics, conditioned on the aspect. The result is an autoencoding neural topic model for reviews, which can be trained in a fully unsupervised manner to learn topics that are structured into aspects. Experimental evaluation on large number of datasets demonstrates that aspects are interpretable, yield higher coherence scores than non-structured autoencoding topic model variants, and can be utilized to perform aspect-based comparison and genre discovery.

    点赞 0
    阅读0+
子主题
Top