Entity Matching (EM), which aims to identify whether two entity records from two relational tables refer to the same real-world entity, is one of the fundamental problems in data management. Traditional EM assumes that two tables are homogeneous with the aligned schema, while it is common that entity records of different formats (e.g., relational, semi-structured, or textual types) involve in practical scenarios. It is not practical to unify their schemas due to the different formats. To support EM on format-different entity records, Generalized Entity Matching (GEM) has been proposed and gained much attention recently. To do GEM, existing methods typically perform in a supervised learning way, which relies on a large amount of high-quality labeled examples. However, the labeling process is extremely labor-intensive, and frustrates the use of GEM. Low-resource GEM, i.e., GEM that only requires a small number of labeled examples, becomes an urgent need. To this end, this paper, for the first time, focuses on the low-resource GEM and proposes a novel low-resource GEM method, termed as PromptEM. PromptEM has addressed three challenging issues (i.e., designing GEM-specific prompt-tuning, improving pseudo-labels quality, and running efficient self-training) in low-resource GEM. Extensive experimental results on eight real benchmarks demonstrate the superiority of PromptEM in terms of effectiveness and efficiency.
翻译:实体匹配(EM)旨在确定两个关系表中的两个实体记录是否指同一真实世界实体,这是数据管理的根本问题之一。传统的EM认为,两个表格与一致的模型相同,而不同格式(如关系、半结构或文字类型)的实体记录通常都涉及实际情景。由于格式不同,统一其方案并不切实际。为了支持实体记录格式不同,已经提议通用实体匹配(GEM),最近引起了很大的注意。为做到GEM,现有方法通常以监督的学习方式运作,这依赖于大量高质量的标签范例。然而,标记过程极为劳力密集,妨碍了GEM、低资源GEM、即GEM等仅需要少量标签实例的组合。为此,本文件首次侧重于低资源通用实体匹配(GEM),并提出了一个新的低资源效率的低资源学习方式,这取决于大量高质量的标签范例。然而,标签过程极为困难地使用GEM. 低资源GEM,即全球EM,只需要少量的标签示例,成为一项紧迫的需要。为此,本文件首次侧重于低资源通用实体匹配的GEM质量,并提出了一个新的低资源快速地改进GEM质量。