In neural Information Retrieval (IR), ongoing research is directed towards improving the first retriever in ranking pipelines. Learning dense embeddings to conduct retrieval using efficient approximate nearest neighbors methods has proven to work well. Meanwhile, there has been a growing interest in learning \emph{sparse} representations for documents and queries, that could inherit from the desirable properties of bag-of-words models such as the exact matching of terms and the efficiency of inverted indexes. Introduced recently, the SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches. In this paper, we build on SPLADE and propose several significant improvements in terms of effectiveness and/or efficiency. More specifically, we modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation. We also report results on the BEIR benchmark. Overall, SPLADE is considerably improved with more than $9$\% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.
翻译:在神经信息检索(IR)中,正在进行的研究旨在改进排位管道的第一检索器。学习密集的嵌入器,以便使用高效近邻近邻方法进行检索,已证明是行之有效的。与此同时,人们越来越有兴趣学习文件和查询的缩略语,这些缩略语可以继承一袋词模型的可取特性,如确切的术语匹配和反向指数的效率。最近推出的苏人解模式在最新密集和稀少方法方面提供了极为稀少的表述和竞争性结果。在本文中,我们利用苏人解,提出了在有效性和/或效率方面进行重大改进的建议。更具体地说,我们修改集合机制,仅以文件扩展为基准,并采用经过精练培训的模式。我们还报告了BER基准的结果。总体而言,苏人解取得了显著改善,NDCG@10在TREC DL 2019上取得了超过9美元的收入,从而导致BIR基准取得最新结果。