Entity Matching (EM) aims at recognizing entity records that denote the same real-world object. Neural EM models learn vector representation of entity descriptions and match entities end-to-end. Though robust, these methods require many resources for training, and lack of interpretability. In this paper, we propose a novel EM framework that consists of Heterogeneous Information Fusion (HIF) and Key Attribute Tree (KAT) Induction to decouple feature representation from matching decision. Using self-supervised learning and mask mechanism in pre-trained language modeling, HIF learns the embeddings of noisy attribute values by inter-attribute attention with unlabeled data. Using a set of comparison features and a limited amount of annotated data, KAT Induction learns an efficient decision tree that can be interpreted by generating entity matching rules whose structure is advocated by domain experts. Experiments on 6 public datasets and 3 industrial datasets show that our method is highly efficient and outperforms SOTA EM models in most cases. Our codes and datasets can be obtained from https://github.com/THU-KEG/HIF-KAT.
翻译:实体匹配(EM)的目的是承认代表同一真实世界物体的实体记录。神经EM模型学习实体说明的矢量代表,并与实体端对端匹配。这些方法虽然很健全,但需要许多培训资源,缺乏解释性。在本文件中,我们提议了一个全新的EM框架,由异质信息融合和关键属性树组成,通过引入将特征代表与匹配决定相匹配。在培训前的语言模型中使用自我监督的学习和掩码机制,HIF通过不贴标签的数据,通过跨属性的注意来学习噪音属性值的嵌入。使用一套比较特征和有限的附加说明数据,KAT Inging学会一种高效的决策树,可以通过生成由域专家倡导的结构匹配的实体规则加以解释。对6个公共数据集和3个工业数据集的实验表明,我们的方法效率很高,在多数情况下都超越SOTA EM模型。我们的代码和数据集可以从https://github.com/THHU/GHG/HGAT获得。