Many datasets are naturally modeled as graphs, where vertices denote entities and edges encode pairwise interactions. However, some problems exhibit higher-order structure that lies beyond this framework. Among the simplest examples is line clustering, in which points in a Euclidean space are grouped into clusters well approximated by line segments. As any two points trivially determine a line, the relevant structure emerges only when considering higher-order tuples. To capture this, we construct a 3-uniform hypergraph by treating sets of three points as hyperedges whenever they are approximately collinear. This geometric hypergraph encodes information about the underlying line segments, which can be extracted using community recovery algorithms. We characterize the fundamental limits of line clustering and establish the near-optimality of hypergraph-based methods. In particular, we derive information-theoretic thresholds for exact and almost exact recovery for noisy observations from intersecting lines in the plane. Finally, we introduce a polynomial-time spectral algorithm that succeeds up to polylogarithmic factors of the information-theoretic bounds.
翻译:许多数据集自然地建模为图,其中顶点表示实体,边编码成对交互。然而,某些问题展现出超越此框架的高阶结构。最简单的例子之一是直线聚类,其中欧几里得空间中的点被分组为由线段良好近似的簇。由于任意两点平凡地确定一条直线,相关结构仅在考虑高阶元组时才会显现。为捕捉此结构,我们构建了一个3-一致超图:每当三个点近似共线时,我们将其集合视为超边。该几何超图编码了关于底层线段的信息,这些信息可通过社区恢复算法提取。我们刻画了直线聚类的根本极限,并确立了基于超图的方法的近似最优性。具体而言,我们推导了平面中相交直线在噪声观测下实现精确恢复与几乎精确恢复的信息论阈值。最后,我们提出了一种多项式时间谱算法,该算法在信息论界限的多对数因子范围内取得成功。