变分马尔可夫链混合模型及其自动分量选择 (Variational Markov chain mixtures with automatic component selection)

Markov state modeling has gained popularity in various scientific fields since it reduces complex time-series data sets into transitions between a few states. Yet common Markov state modeling frameworks assume a single Markov chain describes the data, so they suffer from an inability to discern heterogeneities. As an alternative, this paper models time-series data using a mixture of Markov chains, and it automatically determines the number of mixture components using the variational expectation-maximization algorithm.Variational EM simultaneously identifies the number of Markov chains and the dynamics of each chain without expensive model comparisons or posterior sampling. As a theoretical contribution, this paper identifies the natural limits of Markov state mixture modeling by proving a lower bound on the classification error. It then presents numerical experiments where variational EM achieves performance consistent with the theoretically optimal error scaling. The experiments are based on synthetic and observational data sets including Last.fm music listening, ultramarathon running, and gene expression. In each of the three data sets, variational EM leads to the identification of meaningful heterogeneities.

翻译：马尔可夫状态建模因其能将复杂的时间序列数据集简化为少数状态间的转移，已在多个科学领域获得广泛应用。然而，常见的马尔可夫状态建模框架通常假设数据由单一马尔可夫链描述，因而无法有效识别数据中的异质性。作为替代方案，本文采用马尔可夫链混合模型对时间序列数据进行建模，并利用变分期望最大化算法自动确定混合分量的数量。变分EM算法能够同时识别马尔可夫链的数量与各链的动态特性，无需进行昂贵的模型比较或后验采样。在理论贡献方面，本文通过证明分类误差的下界，揭示了马尔可夫状态混合建模的自然极限。随后通过数值实验表明，变分EM算法取得的性能与理论最优误差缩放规律保持一致。实验基于合成数据与观测数据集，包括Last.fm音乐收听记录、超级马拉松跑步数据及基因表达数据。在这三个数据集中，变分EM算法均成功识别出具有实际意义的异质性模式。