Vector Similarity Search (VSS) in high-dimensional spaces is rapidly emerging as core functionality in next-generation database systems for numerous data-intensive services -- from embedding lookups in large language models (LLMs), to semantic information retrieval and recommendation engines. Current benchmarks, however, evaluate VSS primarily on the recall-latency trade-off against a ground truth defined solely by distance metrics, neglecting how retrieval quality ultimately impacts downstream tasks. This disconnect can mislead both academic research and industrial practice. We present Iceberg, a holistic benchmark suite for end-to-end evaluation of VSS methods in realistic application contexts. From a task-centric view, Iceberg uncovers the Information Loss Funnel, which identifies three principal sources of end-to-end performance degradation: (1) Embedding Loss during feature extraction; (2) Metric Misuse, where distances poorly reflect task relevance; (3) Data Distribution Sensitivity, highlighting index robustness across skews and modalities. For a more comprehensive assessment, Iceberg spans eight diverse datasets across key domains such as image classification, face recognition, text retrieval, and recommendation systems. Each dataset, ranging from 1M to 100M vectors, includes rich, task-specific labels and evaluation metrics, enabling assessment of retrieval algorithms within the full application pipeline rather than in isolation. Iceberg benchmarks 13 state-of-the-art VSS methods and re-ranks them based on application-level metrics, revealing substantial deviations from traditional rankings derived purely from recall-latency evaluations. Building on these insights, we define a set of task-centric meta-features and derive an interpretable decision tree to guide practitioners in selecting and tuning VSS methods for their specific workloads.
翻译:高维空间中的向量相似性搜索正迅速成为下一代数据库系统的核心功能,广泛应用于从大型语言模型的嵌入查找、语义信息检索到推荐引擎等众多数据密集型服务。然而,当前基准测试主要基于距离度量定义的基准真值,在召回率-延迟权衡方面评估VSS,忽略了检索质量如何最终影响下游任务。这种脱节可能误导学术研究和工业实践。我们提出Iceberg,一个用于在实际应用场景中对VSS方法进行端到端评估的整体基准套件。从任务中心视角出发,Iceberg揭示了信息损失漏斗,该漏斗识别了端到端性能下降的三个主要来源:(1) 特征提取过程中的嵌入损失;(2) 度量误用,即距离无法有效反映任务相关性;(3) 数据分布敏感性,突显索引在不同偏斜和模态下的鲁棒性。为进行全面评估,Iceberg涵盖图像分类、人脸识别、文本检索和推荐系统等关键领域的八个多样化数据集。每个数据集包含100万至1亿个向量,并配有丰富的任务特定标签和评估指标,从而能够在完整应用流程中而非孤立地评估检索算法。Iceberg对13种最先进的VSS方法进行基准测试,并基于应用级指标重新排序,结果显示其与传统仅基于召回率-延迟评估的排名存在显著偏差。基于这些洞察,我们定义了一组任务中心的元特征,并推导出一个可解释的决策树,以指导从业者根据其特定工作负载选择和调优VSS方法。