Finding a set of nested partitions of a dataset is useful to uncover relevant structure at different scales, and is often dealt with a data-dependent methodology. In this paper, we introduce a general two-step methodology for model-based hierarchical clustering. Considering the integrated classification likelihood criterion as an objective function, this work applies to every discrete latent variable models (DLVMs) where this quantity is tractable. The first step of the methodology involves maximizing the criterion with respect to the partition. Addressing the known problem of sub-optimal local maxima found by greedy hill climbing heuristics, we introduce a new hybrid algorithm based on a genetic algorithm efficiently exploring the space of solutions. The resulting algorithm carefully combines and merges different solutions, and allows the joint inference of the number $K$ of clusters as well as the clusters themselves. Starting from this natural partition, the second step of the methodology is based on a bottom-up greedy procedure to extract a hierarchy of clusters. In a Bayesian context, this is achieved by considering the Dirichlet cluster proportion prior parameter $\alpha$ as a regularization term controlling the granularity of the clustering. A new approximation of the criterion is derived as a log-linear function of $\alpha$, enabling a simple functional form of the merge decision criterion. This second step allows the exploration of the clustering at coarser scales. The proposed approach is compared with existing strategies on simulated as well as real settings, and its results are shown to be particularly relevant. A reference implementation of this work is available in the R package greed accompanying the paper.
翻译:查找一套数据集嵌套分割区有助于在不同尺度上发现相关结构,并经常采用基于数据的方法。 在本文中, 我们引入基于基因算法的新的混合算法, 高效地探索解决方案的空间。 由此得出的算法会谨慎地结合和合并不同的解决方案, 并允许对组合的美元和组本身进行联合推论。 从这个自然分区开始, 方法的第二步是以自下而上的贪婪程序为基础, 以提取组合的等级。 在巴伊西亚, 将Drichlet 随行的 美元/ alpha 群集比例作为常规化术语, 以有效的方式来控制解决方案的空间。 由此得出的算法会将不同的解决方案合并和合并, 并允许对组合的美元和组本身进行联合推算。 以这一自然分区开始, 方法的第二步将基于一个自下至上层的贪婪程序, 以Bayesian为背景, 将Drichlet 组合群集的上前一个参数 $\alpha, 作为固定的参考术语, 。 将这个最小的缩缩缩缩缩缩缩缩缩缩的缩定义 标准是用于当前 。 。