Heterogeneous, mixed type datasets including both continuous and categorical variables are ubiquitous, and enriches data analysis by allowing for more complex relationships and interactions to be modelled. Mixture models offer a flexible framework for capturing the underlying heterogeneity and relationships in mixed type datasets. Most current approaches for modelling mixed data either forgo uncertainty quantification and only conduct point estimation, and some use MCMC which incurs a very high computational cost that is not scalable to large datasets. This paper develops a coordinate ascent variational inference algorithm (CAVI) for mixture models on mixed (continuous and categorical) data, which circumvents the high computational cost of MCMC while retaining uncertainty quantification. We demonstrate our approach through simulation studies as well as an applied case study of the NHANES risk factor dataset. We provide theoretical justification for our method, showing that as the sample size $n$ tends to infinity, the variational posterior mean converges locally to the true data-generating parameter value, and that it converges locally to the maximum likelihood estimator at the rate of $O(1/n)$.
翻译:包含连续变量和分类变量的异构混合类型数据集普遍存在,通过允许建模更复杂的关系和交互,丰富了数据分析。混合模型为捕捉混合类型数据集中潜在的异质性和关系提供了灵活的框架。当前大多数混合数据建模方法要么放弃不确定性量化仅进行点估计,要么使用计算成本极高且无法扩展至大型数据集的MCMC方法。本文针对混合(连续和分类)数据的混合模型,开发了一种坐标上升变分推断算法(CAVI),在保留不确定性量化的同时规避了MCMC的高计算成本。我们通过模拟研究以及NHANES风险因素数据集的应用案例研究展示了该方法。我们提供了该方法的理论证明:当样本量$n$趋于无穷时,变分后验均值局部收敛于真实数据生成参数值,且以$O(1/n)$的速率局部收敛于最大似然估计量。