** Characterizing the exact asymptotic distributions of high-dimensional eigenvectors for large structured random matrices poses important challenges yet can provide useful insights into a range of applications. To this end, in this paper we introduce a general framework of asymptotic theory of eigenvectors (ATE) for large structured symmetric random matrices with heterogeneous variances, and establish the asymptotic properties of the spiked eigenvectors and eigenvalues for the scenario of the generalized Wigner matrix noise, where the mean matrix is assumed to have the low-rank structure. Under some mild regularity conditions, we provide the asymptotic expansions for the spiked eigenvalues and show that they are asymptotically normal after some normalization. For the spiked eigenvectors, we establish novel asymptotic expansions for the general linear combination and further show that it is asymptotically normal after some normalization, where the weight vector can be arbitrary. We also provide a more general asymptotic theory for the spiked eigenvectors using the bilinear form. Simulation studies verify the validity of our new theoretical results. Our family of models encompasses many popularly used ones such as the stochastic block models with or without overlapping communities for network analysis and the topic models for text analysis, and our general theory can be exploited for statistical inference in these large-scale applications. **

** Aspect-based Opinion Summary (AOS), consisting of aspect discovery and sentiment classification steps, has recently been emerging as one of the most crucial data mining tasks in e-commerce systems. Along this direction, the LDA-based model is considered as a notably suitable approach, since this model offers both topic modeling and sentiment classification. However, unlike traditional topic modeling, in the context of aspect discovery it is often required some initial seed words, whose prior knowledge is not easy to be incorporated into LDA models. Moreover, LDA approaches rely on sampling methods, which need to load the whole corpus into memory, making them hardly scalable. In this research, we study an alternative approach for AOS problem, based on Autoencoding Variational Inference (AVI). Firstly, we introduce the Autoencoding Variational Inference for Aspect Discovery (AVIAD) model, which extends the previous work of Autoencoding Variational Inference for Topic Models (AVITM) to embed prior knowledge of seed words. This work includes enhancement of the previous AVI architecture and also modification of the loss function. Ultimately, we present the Autoencoding Variational Inference for Joint Sentiment/Topic (AVIJST) model. In this model, we substantially extend the AVI model to support the JST model, which performs topic modeling for corresponding sentiment. The experimental results show that our proposed models enjoy higher topic coherent, faster convergence time and better accuracy on sentiment classification, as compared to their LDA-based counterparts. **

** Massive volumes of data continuously generated on social platforms have become an important information source for users. A primary method to obtain fresh and valuable information from social streams is \emph{social search}. Although there have been extensive studies on social search, existing methods only focus on the \emph{relevance} of query results but ignore the \emph{representativeness}. In this paper, we propose a novel Semantic and Influence aware $k$-Representative ($k$-SIR) query for social streams based on topic modeling. Specifically, we consider that both user queries and elements are represented as vectors in the topic space. A $k$-SIR query retrieves a set of $k$ elements with the maximum \emph{representativeness} over the sliding window at query time w.r.t. the query vector. The representativeness of an element set comprises both semantic and influence scores computed by the topic model. Subsequently, we design two approximation algorithms, namely \textsc{Multi-Topic ThresholdStream} (MTTS) and \textsc{Multi-Topic ThresholdDescend} (MTTD), to process $k$-SIR queries in real-time. Both algorithms leverage the ranked lists maintained on each topic for $k$-SIR processing with theoretical guarantees. Extensive experiments on real-world datasets demonstrate the effectiveness of $k$-SIR query compared with existing methods as well as the efficiency and scalability of our proposed algorithms for $k$-SIR processing. **

** In the era of big science, countries allocate big research and development budgets to large scientific facilities that boost collaboration and research capability. A nuclear fusion device called the "tokamak" is a source of great interest for many countries because it ideally generates sustainable energy expected to solve the energy crisis in the future. Here, to explore the scientific effects of tokamaks, we map a country's research capability in nuclear fusion research with normalized revealed comparative advantage on five topical clusters -- material, plasma, device, diagnostics, and simulation -- detected through a dynamic topic model. Our approach captures not only the growth of China, India, and the Republic of Korea but also the decline of Canada, Japan, Sweden, and the Netherlands. Time points of their rise and fall are related to tokamak operation, highlighting the importance of large facilities in big science. The gravity model points out that two countries collaborate less in device, diagnostics, and plasma research if they have comparative advantages in different topics. This relation is a unique feature of nuclear fusion compared to other science fields. Our results can be used and extended when building national policies for big science. **

** Privacy is a major issue in learning from distributed data. Recently the cryptographic literature has provided several tools for this task. However, these tools either reduce the quality/accuracy of the learning algorithm---e.g., by adding noise---or they incur a high performance penalty and/or involve trusting external authorities. We propose a methodology for {\sl private distributed machine learning from light-weight cryptography} (in short, PD-ML-Lite). We apply our methodology to two major ML algorithms, namely non-negative matrix factorization (NMF) and singular value decomposition (SVD). Our resulting protocols are communication optimal, achieve the same accuracy as their non-private counterparts, and satisfy a notion of privacy---which we define---that is both intuitive and measurable. Our approach is to use lightweight cryptographic protocols (secure sum and normalized secure sum) to build learning algorithms rather than wrap complex learning algorithms in a heavy-cost MPC framework. We showcase our algorithms' utility and privacy on several applications: for NMF we consider topic modeling and recommender systems, and for SVD, principal component regression, and low rank approximation. **

** We address two challenges in topic models: (1) Context information around words helps in determining their actual meaning, e.g., "networks" used in the contexts "artificial neural networks" vs. "biological neuron networks". Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. The proposed model is named as iDocNADE. (2) Due to the small number of word occurrences (i.e., lack of context) in short text and data sparsity in a corpus of few documents, the application of topic models is challenging on such texts. Therefore, we propose a simple and efficient way of incorporating external knowledge into neural autoregressive topic models: we use embeddings as a distributional prior. The proposed variants are named as DocNADEe and iDocNADEe. We present novel neural autoregressive topic model variants that consistently outperform state-of-the-art generative topic models in terms of generalization, interpretability (topic coherence) and applicability (retrieval and classification) over 7 long-text and 8 short-text datasets from diverse domains. **

** In this paper, we derive the asymptotic behavior of the Bayesian generalization error in the topic model. By theoretical analysis of the maximum pole of the zeta function (real log canonical threshold) of the topic model, we obtain an upper bound of the Bayesian generalization error and the free energy in the topic model and the stochastic matrix factorization (SMF; it can be regarded as a restriction of the non-negative matrix factorization). The results show that the generalization error in the topic model and SMF becomes smaller than regular statistical models if Bayesian inference is attained. **

** This paper proposes Dirichlet Variational Autoencoder (DirVAE) using a Dirichlet prior for a continuous latent variable that exhibits the characteristic of the categorical probabilities. To infer the parameters of DirVAE, we utilize the stochastic gradient method by approximating the Gamma distribution, which is a component of the Dirichlet distribution, with the inverse Gamma CDF approximation. Additionally, we reshape the component collapsing issue by investigating two problem sources, which are decoder weight collapsing and latent value collapsing, and we show that DirVAE has no component collapsing; while Gaussian VAE exhibits the decoder weight collapsing and Stick-Breaking VAE shows the latent value collapsing. The experimental results show that 1) DirVAE models the latent representation result with the best log-likelihood compared to the baselines; and 2) DirVAE produces more interpretable latent values with no collapsing issues which the baseline models suffer from. Also, we show that the learned latent representation from the DirVAE achieves the best classification accuracy in the semi-supervised and the supervised classification tasks on MNIST, OMNIGLOT, and SVHN compared to the baseline VAEs. Finally, we demonstrated that the DirVAE augmented topic models show better performances in most cases. **