We introduce a novel approach for estimating Latent Dirichlet Allocation (LDA) parameters from collapsed Gibbs samples (CGS), by leveraging the full conditional distributions over the latent variable assignments to efficiently average over multiple samples, for little more computational cost than drawing a single additional collapsed Gibbs sample. Our approach can be understood as adapting the soft clustering methodology of Collapsed Variational Bayes (CVB0) to CGS parameter estimation, in order to get the best of both techniques. Our estimators can straightforwardly be applied to the output of any existing implementation of CGS, including modern accelerated variants. We perform extensive empirical comparisons of our estimators with those of standard collapsed inference algorithms on real-world data for both unsupervised LDA and Prior-LDA, a supervised variant of LDA for multi-label classification. Our results show a consistent advantage of our approach over traditional CGS under all experimental conditions, and over CVB0 inference in the majority of conditions. More broadly, our results highlight the importance of averaging over multiple samples in LDA parameter estimation, and the use of efficient computational techniques to do so.
翻译:我们采用一种新颖的方法来估计倒塌的Gibbs样本(CGS)的液晶稀释分配(LDA)参数,利用潜在变数分配的完全有条件分布,在多个样本中有效平均,而计算成本比再绘制一个倒塌的Gibs样本少得多。我们的方法可以被理解为将CFS(CVB0)的软组合方法与CGS(CGS)参数估计相适应,以便取得两种技术的最佳效果。我们的估计器可以直接应用到目前实施CGS(包括现代加速变异器)的任何产出中。我们用大量的经验比较了我们的估计器与未经监督的LDA(LDA)和前LLLDA(LDA的监管变式)在现实世界数据中标准崩溃的推算法数据。我们的结果显示,在所有实验条件下,我们的方法对于传统的CGS(CVB0)的推算器具有一贯的优势。更广义地表明,我们在LDA参数估计中平均使用多个样本的重要性,以及使用有效的计算技术来做到这一点。