We introduce an improved unsupervised clustering protocol specially suited for large-scale structured data. The protocol follows three steps: a dimensionality reduction of the data, a density estimation over the low dimensional representation of the data, and a final segmentation of the density landscape. For the dimensionality reduction step we introduce a parallelized implementation of the well-known t-Stochastic Neighbouring Embedding (t-SNE) algorithm that significantly alleviates some inherent limitations, while improving its suitability for large datasets. We also introduce a new adaptive Kernel Density Estimation particularly coupled with the t-SNE framework in order to get accurate density estimates out of the embedded data, and a variant of the rainfalling watershed algorithm to identify clusters within the density landscape. The whole mapping protocol is wrapped in the bigMap R package, together with visualization and analysis tools to ease the qualitative and quantitative assessment of the clustering.
翻译:我们引入了专门适合大型结构化数据的改良的未经监督的集群协议。协议遵循了三个步骤:数据的维度减少,对数据低维度表示的密度估计,以及密度景观的最后分割。对于维度减少步骤,我们引入了对众所周知的T-Schacistic相邻嵌入(t-SNE)算法的平行实施,该算法大大缓解了某些内在限制,同时改善了其对大型数据集的适合性。我们还引入了一种新的适应性内核密度估计法,特别是T-SNE框架,以便从嵌入的数据中获取准确的密度估计,以及降雨流域算法的变式,以识别密度景观内的集群。整个测绘协议包包包在大Map R 包中,同时结合了可视化和分析工具,以方便对集群进行定性和定量评估。