The Hausdorff distance (HD) is a robust measure of set dissimilarity, but computing it exactly on large, high-dimensional datasets is prohibitively expensive. We propose \textbf{ProHD}, a projection-guided approximation algorithm that dramatically accelerates HD computation while maintaining high accuracy. ProHD identifies a small subset of candidate "extreme" points by projecting the data onto a few informative directions (such as the centroid axis and top principal components) and computing the HD on this subset. This approach guarantees an underestimate of the true HD with a bounded additive error and typically achieves results within a few percent of the exact value. In extensive experiments on image, physics, and synthetic datasets (up to two million points in $D=256$), ProHD runs 10--100$\times$ faster than exact algorithms while attaining 5--20$\times$ lower error than random sampling-based approximations. Our method enables practical HD calculations in scenarios like large vector databases and streaming data, where quick and reliable set distance estimation is needed.
翻译:豪斯多夫距离(HD)是一种鲁棒的集合差异度量方法,但在大规模、高维数据集上进行精确计算的计算代价极高。本文提出\\textbf{ProHD},一种基于投影引导的近似算法,能够在保持高精度的同时显著加速HD计算。ProHD通过将数据投影到少数信息丰富的方向(例如质心轴和前几个主成分)上,识别出一个较小的候选“极值”点子集,并在该子集上计算HD。该方法保证对真实HD的低估具有有界加性误差,并且通常能够获得与精确值相差几个百分比以内的结果。在图像、物理和合成数据集(规模高达两百万个点,维度$D=256$)上进行的大量实验中,ProHD的运行速度比精确算法快10--100倍,同时其误差比基于随机采样的近似方法低5--20倍。我们的方法使得在大规模向量数据库和流数据等需要快速可靠集合距离估计的场景中进行实用的HD计算成为可能。