SPER：通过随机二分图最大化加速渐进式实体解析 (SPER: Accelerating Progressive Entity Resolution via Stochastic Bipartite Maximization)

Entity Resolution (ER) is a critical data cleaning task for identifying records that refer to the same real-world entity. In the era of Big Data, traditional batch ER is often infeasible due to volume and velocity constraints, necessitating Progressive ER methods that maximize recall within a limited computational budget. However, existing progressive approaches fail to scale to high-velocity streams because they rely on deterministic sorting to prioritize candidate pairs, a process that incurs prohibitive super-linear complexity and heavy initialization costs. To address this scalability wall, we introduce SPER (Stochastic Progressive ER), a novel framework that redefines prioritization as a sampling problem rather than a ranking problem. By replacing global sorting with a continuous stochastic bipartite maximization strategy, SPER acts as a probabilistic high-pass filter that selects high-utility pairs in strictly linear time. Extensive experiments on eight real-world datasets demonstrate that SPER achieves significant speedups (3x to 6x) over state-of-the-art baselines while maintaining comparable recall and precision.

翻译：实体解析（Entity Resolution，ER）是一项关键的数据清洗任务，旨在识别指向同一现实世界实体的记录。在大数据时代，由于数据体量和处理速度的限制，传统的批量式实体解析往往难以实施，因此需要采用渐进式实体解析方法，以在有限的计算预算内最大化召回率。然而，现有的渐进式方法无法扩展到高速数据流，因为它们依赖于确定性排序来优先处理候选对，这一过程会产生难以承受的超线性复杂度和高昂的初始化成本。为解决这一可扩展性瓶颈，我们提出了SPER（随机渐进式实体解析），这是一个新颖的框架，它将优先级排序重新定义为采样问题而非排序问题。通过用连续随机二分图最大化策略替代全局排序，SPER充当了一个概率性高通滤波器，能够在严格的线性时间内选择高效用对。在八个真实世界数据集上进行的大量实验表明，SPER在保持可比召回率和精确度的同时，相比最先进的基线方法实现了显著的加速（3倍至6倍）。

相关内容

实体解析

关注 5

不同的数据提供方对同一个事物即实体 (Entity)可能会有不同的描述 (这里的描述包括数据格式、表示方法等) ，每一个对实体的描述称为该实体的一个引用。实体解析，是指从一个“ 引用集合”中解析并映射到现实世界中的“ 实体”过程。实体解析(Entity Resolution)又被称为记录链接(Record Linkage) 、对象识别(object Identification ) 、个体识别(Individual Identification) 、重复检测(Duplicate Detection)

【ICML2025】QuRe：通过困难负样本采样实现查询相关的组合图像检索

专知会员服务

7+阅读 · 7月20日

【CVPR2025】CoLLM：面向组合图像检索的大语言模型

专知会员服务

12+阅读 · 3月26日

【剑桥大学-算法手册】Advanced Algorithms, Artificial Intelligence

专知会员服务

36+阅读 · 2024年11月11日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日