Training effective dense retrieval models typically relies on hard negative (HN) examples mined from large document corpora using methods such as BM25 or cross-encoders, which require full corpus access and expensive index construction. We propose generating synthetic hard negatives directly from a provided query and positive passage, using Large Language Models(LLMs). We fine-tune DistilBERT using synthetic negatives generated by four state-of-the-art LLMs ranging from 4B to 30B parameters (Qwen3, LLaMA3, Phi4) and evaluate performance across 10 BEIR benchmark datasets. Contrary to the prevailing assumption that stronger generative models yield better synthetic data, find that our generative pipeline consistently underperforms traditional corpus-based mining strategies (BM25 and Cross-Encoder). Furthermore, we observe that scaling the generator model does not monotonically improve retrieval performance and find that the 14B parameter model outperforms the 30B model and in some settings it is the worst performing.
翻译:训练有效的稠密检索模型通常依赖于从大规模文档语料库中挖掘的困难负例,这些负例通过BM25或交叉编码器等方法获取,需要完整的语料库访问权限和昂贵的索引构建成本。我们提出直接利用大语言模型根据给定查询和正例段落生成合成困难负例。我们使用四种参数量从40亿到300亿不等的先进大语言模型(Qwen3、LLaMA3、Phi4)生成的合成负例对DistilBERT进行微调,并在10个BEIR基准数据集上评估性能。与“更强的生成模型能产生更优质合成数据”的主流假设相反,我们发现我们的生成流程始终逊于传统的基于语料库的挖掘策略(BM25与交叉编码器)。此外,我们观察到生成模型的规模扩展并不会单调提升检索性能,14B参数模型的表现优于30B模型,在某些场景下甚至成为性能最差的模型。