Clustering is widely used in unsupervised learning to find homogeneous groups of observations within a dataset. However, clustering mixed-type data remains a challenge, as few existing approaches are suited for this task. This study presents the state-of-the-art of these approaches and compares them using various simulation models. The compared methods include the distance-based approaches k-prototypes, PDQ, and convex k-means, and the probabilistic methods KAy-means for MIxed LArge data (KAMILA), the mixture of Bayesian networks (MBNs), and latent class model (LCM). The aim is to provide insights into the behavior of different methods across a wide range of scenarios by varying some experimental factors such as the number of clusters, cluster overlap, sample size, dimension, proportion of continuous variables in the dataset, and clusters' distribution. The degree of cluster overlap and the proportion of continuous variables in the dataset and the sample size have a significant impact on the observed performances. When strong interactions exist between variables alongside an explicit dependence on cluster membership, none of the evaluated methods demonstrated satisfactory performance. In our experiments KAMILA, LCM, and k-prototypes exhibited the best performance, with respect to the adjusted rand index (ARI). All the methods are available in R.
翻译:聚类是无监督学习中广泛使用的方法,用于在数据集中发现观测值的同质分组。然而,对混合类型数据进行聚类仍然是一个挑战,因为现有方法中适合此任务的较少。本研究介绍了这些方法的最新进展,并使用多种仿真模型对其进行了比较。比较的方法包括基于距离的方法k-prototypes、PDQ和凸k-means,以及概率方法KAy-means for MIxed LArge data (KAMILA)、贝叶斯网络混合模型(MBNs)和潜在类别模型(LCM)。本研究旨在通过改变一些实验因素(如聚类数量、聚类重叠度、样本量、维度、数据集中连续变量的比例以及聚类分布),深入了解不同方法在多种场景下的表现。聚类重叠度、数据集中连续变量的比例以及样本量对观察到的性能有显著影响。当变量之间存在强交互作用且明确依赖于聚类成员关系时,所有评估的方法均未表现出令人满意的性能。在我们的实验中,KAMILA、LCM和k-prototypes在调整兰德指数(ARI)方面表现出最佳性能。所有方法均可在R语言中实现。