Metric based comparison operations such as finding maximum, nearest and farthest neighbor are fundamental to studying various clustering techniques such as $k$-center clustering and agglomerative hierarchical clustering. These techniques crucially rely on accurate estimation of pairwise distance between records. However, computing exact features of the records, and their pairwise distances is often challenging, and sometimes not possible. We circumvent this challenge by leveraging weak supervision in the form of a comparison oracle that compares the relative distance between the queried points such as `Is point u closer to v or w closer to x?'. However, it is possible that some queries are easier to answer than others using a comparison oracle. We capture this by introducing two different noise models called adversarial and probabilistic noise. In this paper, we study various problems that include finding maximum, nearest/farthest neighbor search under these noise models. Building upon the techniques we develop for these comparison operations, we give robust algorithms for k-center clustering and agglomerative hierarchical clustering. We prove that our algorithms achieve good approximation guarantees with a high probability and analyze their query complexity. We evaluate the effectiveness and efficiency of our techniques empirically on various real-world datasets.
翻译:基于计量的比较操作,例如寻找最大、最近和最远的邻居,对于研究诸如$k$中央集聚和聚居群集等各种集群技术至关重要。这些技术的关键在于准确估计记录之间的对称距离。然而,计算记录的确切特征及其对称距离往往具有挑战性,有时甚至不可能。我们通过利用薄弱的监督手段,以比较“比较点接近于或接近于x?”等查询点之间的相对距离来规避这一挑战。然而,有些查询可能比其他查询更容易用比较或触角来回答。我们通过采用两种不同的噪音模型,即对抗性和概率性噪音来捕捉到这一点。在本文中,我们研究各种问题,包括在这些噪音模型下找到最大、最接近/最远的邻居搜索。我们根据为这些比较操作开发的技术,我们为 k-center 集聚和 glominal 级集提供了强有力的算法。我们证明我们的算法在高概率下实现了良好的近似保证,并分析了其查询复杂性。我们评估了我们数据的真实性与效率。