Retrieving code units (e.g., files, classes, functions) that are semantically relevant to a given user query, bug report, or feature request from large codebases is a fundamental challenge for LLM-based coding agents. Agentic approaches typically employ sparse retrieval methods like BM25 or dense embedding strategies to identify relevant units. While embedding-based approaches can outperform BM25 by large margins, they often lack exploration of the codebase and underutilize its underlying graph structure. To address this, we propose SpIDER (Spatially Informed Dense Embedding Retrieval), an enhanced dense retrieval approach that incorporates LLM-based reasoning over auxiliary context obtained through graph-based exploration of the codebase. Empirical results show that SpIDER consistently improves dense retrieval performance across several programming languages.
翻译:从大型代码库中检索与给定用户查询、缺陷报告或功能请求语义相关的代码单元(如文件、类、函数),是基于大语言模型的编码智能体面临的一项基础性挑战。智能体方法通常采用稀疏检索方法(如BM25)或密集嵌入策略来识别相关单元。尽管基于嵌入的方法在性能上可大幅超越BM25,但它们往往缺乏对代码库的探索,且未能充分利用其底层的图结构。为此,我们提出SpIDER(空间感知密集嵌入检索),这是一种增强的密集检索方法,它通过基于图的代码库探索获取辅助上下文,并融入基于大语言模型的推理。实证结果表明,SpIDER在多种编程语言中均能持续提升密集检索性能。