Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection, reduces the training set size by 2.35$\times$, surprisingly increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We utilize LLMs as a simple, cost-effective approach to identify and relabel false negatives in training datasets. Experimental results show that relabeling false negatives as true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7$\unicode{x2013}$1.4 points on BEIR and by 1.7$\unicode{x2013}$1.8 points at nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of LLMs to identify false negatives is supported by human annotation results. Our training dataset and code are publicly available.
翻译:训练鲁棒的检索与重排序模型通常依赖于大规模检索数据集;例如,BGE数据集包含160万个从多种数据源获取的查询-段落对。然而,我们发现某些数据集可能对模型效能产生负面影响——从BGE数据集中剔除15个数据集中的8个,虽然使训练集规模减少了2.35倍,却意外地将BEIR上的nDCG@10提升了1.0分。这一发现促使我们对训练数据质量进行更深入的审视,尤其关注"假负例"问题,即相关段落被错误标注为不相关的情况。我们利用LLM作为一种简单、经济高效的方法来识别并重新标注训练数据集中的假负例。实验结果表明,将假负例重新标注为真正例后,E5(基础版)和Qwen2.5-7B检索模型在BEIR上的性能提升了0.7–1.4分,在零样本AIR-Bench评估的nDCG@10指标上提升了1.7–1.8分。在重标注数据上微调的重排序模型也观察到类似的性能提升,例如BEIR上的Qwen2.5-3B模型。LLM识别假负例的可靠性得到了人工标注结果的支持。我们的训练数据集与代码均已公开。