Entity Resolution (ER) is a critical data cleaning task for identifying records that refer to the same real-world entity. In the era of Big Data, traditional batch ER is often infeasible due to volume and velocity constraints, necessitating Progressive ER methods that maximize recall within a limited computational budget. However, existing progressive approaches fail to scale to high-velocity streams because they rely on deterministic sorting to prioritize candidate pairs, a process that incurs prohibitive super-linear complexity and heavy initialization costs. To address this scalability wall, we introduce SPER (Stochastic Progressive ER), a novel framework that redefines prioritization as a sampling problem rather than a ranking problem. By replacing global sorting with a continuous stochastic bipartite maximization strategy, SPER acts as a probabilistic high-pass filter that selects high-utility pairs in strictly linear time. Extensive experiments on eight real-world datasets demonstrate that SPER achieves significant speedups (3x to 6x) over state-of-the-art baselines while maintaining comparable recall and precision.
翻译:实体解析(Entity Resolution,ER)是一项关键的数据清洗任务,旨在识别指向同一现实世界实体的记录。在大数据时代,由于数据体量和处理速度的限制,传统的批量式实体解析往往难以实施,因此需要采用渐进式实体解析方法,以在有限的计算预算内最大化召回率。然而,现有的渐进式方法无法扩展到高速数据流,因为它们依赖于确定性排序来优先处理候选对,这一过程会产生难以承受的超线性复杂度和高昂的初始化成本。为解决这一可扩展性瓶颈,我们提出了SPER(随机渐进式实体解析),这是一个新颖的框架,它将优先级排序重新定义为采样问题而非排序问题。通过用连续随机二分图最大化策略替代全局排序,SPER充当了一个概率性高通滤波器,能够在严格的线性时间内选择高效用对。在八个真实世界数据集上进行的大量实验表明,SPER在保持可比召回率和精确度的同时,相比最先进的基线方法实现了显著的加速(3倍至6倍)。