StaTIX - 链接数据的统计类型推断 (StaTIX - Statistical Type Inference on Linked Data)

Large knowledge bases typically contain data adhering to various schemas with incomplete and/or noisy type information. This seriously complicates further integration and post-processing efforts, as type information is crucial in correctly handling the data. In this paper, we introduce a novel statistical type inference method, called StaTIX, to effectively infer instance types in Linked Data sets in a fully unsupervised manner. Our inference technique leverages a new hierarchical clustering algorithm that is robust, highly effective, and scalable. We introduce a novel approach to reduce the processing complexity of the similarity matrix specifying the relations between various instances in the knowledge base. This approach speeds up the inference process while also improving the correctness of the inferred types due to the noise attenuation in the input data. We further optimize the clustering process by introducing a dedicated hash function that speeds up the inference process by orders of magnitude without negatively affecting its accuracy. Finally, we describe a new technique to identify representative clusters from the multi-scale output of our clustering algorithm to further improve the accuracy of the inferred types. We empirically evaluate our approach on several real-world datasets and compare it to the state of the art. Our results show that StaTIX is more efficient than existing methods (both in terms of speed and memory consumption) as well as more effective. StaTIX reduces the F1-score error of the predicted types by about 40% on average compared to the state of the art and improves the execution time by orders of magnitude.

翻译：大型知识库通常包含符合各种图案的数据,其中含有不完整和/或噪音类型的信息。这严重地使进一步整合和后处理努力复杂化,因为类型信息对于正确处理数据至关重要。在本文件中,我们采用了一种新型统计类型的推论方法,称为StaTIX, 以完全不受监督的方式有效推导链接数据集中的实例类型。我们的推论技术利用了新的等级组合算法,这种算法是稳健的、高效的和可缩放的。我们采用了一种新的方法,以降低类似矩阵的处理复杂性,具体说明知识库中各种实例之间的关系。这种方法加快了推断过程,同时由于输入数据中的噪音减弱,还改善了推论类型的正确性。我们进一步优化了组合过程,引入了专门的散列功能,在不对其精确性排序速度上加快了误判过程,但又不影响其准确性。最后,我们用一种新的方法从多级组合算法的多级算出具有代表性的组群集,以进一步提高推断型的精确性。我们通过实验性地评估了我们若干个实体-IX的测算方法,同时也改进了由于输入数据的流流流流流速度,从而比较了StaIX的准确度,从而显示了比Stax的准确度的准确度,从而将Sta-x的进度的进度比了比了Sta的准确性更精确度,将Sta-xxx格式的进度的进度的进度的进度的进度比了比了比了比了比了Sta-级的准确度。