重新思考多向量检索中 Token 检索的作用 (Rethinking the Role of Token Retrieval in Multi-Vector Retrieval) - 专知论文

会员服务 ·

0

向量检索 · ColBERT · 词元分析器 · 评分函数 · 基准 ·

2023 年 4 月 4 日

Rethinking the Role of Token Retrieval in Multi-Vector Retrieval

翻译：重新思考多向量检索中 Token 检索的作用

Jinhyuk Lee,Zhuyun Dai,Sai Meher Karthik Duddu,Tao Lei,Iftekhar Naim,Ming-Wei Chang,Vincent Y. Zhao

from arxiv, 23 pages

Multi-vector retrieval models such as ColBERT [Khattab and Zaharia, 2020] allow token-level interactions between queries and documents, and hence achieve state of the art on many information retrieval benchmarks. However, their non-linear scoring function cannot be scaled to millions of documents, necessitating a three-stage process for inference: retrieving initial candidates via token retrieval, accessing all token vectors, and scoring the initial candidate documents. The non-linear scoring function is applied over all token vectors of each candidate document, making the inference process complicated and slow. In this paper, we aim to simplify the multi-vector retrieval by rethinking the role of token retrieval. We present XTR, ConteXtualized Token Retriever, which introduces a simple, yet novel, objective function that encourages the model to retrieve the most important document tokens first. The improvement to token retrieval allows XTR to rank candidates only using the retrieved tokens rather than all tokens in the document, and enables a newly designed scoring stage that is two-to-three orders of magnitude cheaper than that of ColBERT. On the popular BEIR benchmark, XTR advances the state-of-the-art by 2.8 nDCG@10 without any distillation. Detailed analysis confirms our decision to revisit the token retrieval stage, as XTR demonstrates much better recall of the token retrieval stage compared to ColBERT.

翻译：多向量检索模型 ColBERT [Khattab and Zaharia, 2020] 等允许查询和文档之间进行标记级交互，因此在许多信息检索基准上实现了最先进的结果。然而，它们的非线性评分函数无法扩展到数百万个文档，需要一个三阶段的推理过程：通过标记检索检索初始候选项、访问所有标记向量，并对初始候选文档进行评分。非线性评分函数应用于每个候选文档的所有标记向量，使推理过程复杂而缓慢。在本文中，我们旨在通过重新思考 Token 检索的作用来简化多向量检索。我们提出了 XTR，即 ConteXtualized Token Retriever，它引入了一个简单而新颖的目标函数，鼓励模型先检索最重要的文档标记。Token 检索的改进使得 XTR 只使用检索到的标记对候选进行排名，而不是文档中的所有标记，并且启用了一种新设计的评分阶段，其成本比 ColBERT 的评分阶段低两到三个数量级。在流行的 BEIR 基准测试中，不需要任何蒸馏，XTR 将最先进的结果提高了 2.8 nDCG@10。详细分析确认了我们重新审视 Token 检索阶段的决定，因为 XTR 显示了比 ColBERT 更好的 Token 检索阶段的召回率。

0

相关内容

向量检索

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

ACL2022 | 基于强化学习的实体对齐

ACL2022 | 基于强化学习的实体对齐

专知会员服务

35+阅读 · 2022年3月15日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

专知会员服务

29+阅读 · 2022年3月6日

【神经自然语言处理进展：建模，学习，推理】Progress in Neural NLP: Modeling, Learning, and Reasoning

【神经自然语言处理进展：建模，学习，推理】Progress in Neural NLP: Modeling, Learning, and Reasoning

专知会员服务

78+阅读 · 2020年8月13日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【北京大学】探索提取跨模态信息进行图像caption，Exploring and Distilling Cross-Modal Information for Image Captioning

【北京大学】探索提取跨模态信息进行图像caption，Exploring and Distilling Cross-Modal Information for Image Captioning

专知会员服务

54+阅读 · 2020年3月3日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【WWW2020】学习上下文化文档表示用于医疗答案检索，Learning Contextualized Document Representations for Healthcare Answer Retrieval

【WWW2020】学习上下文化文档表示用于医疗答案检索，Learning Contextualized Document Representations for Healthcare Answer Retrieval

专知会员服务

26+阅读 · 2020年2月10日

【文章|自注意力(self-attention)机制图解】《Illustrated: Self-Attention》by Raimi Karim

【文章|自注意力(self-attention)机制图解】《Illustrated: Self-Attention》by Raimi Karim

专知会员服务

45+阅读 · 2019年11月18日

一文详解深度树检索技术（TDM）三部曲（与Deep Retrieval）

一文详解深度树检索技术（TDM）三部曲（与Deep Retrieval）

PaperWeekly

4+阅读 · 2022年9月21日

BERT为何无法彻底干掉BM25？？

BERT为何无法彻底干掉BM25？？

夕小瑶的卖萌屋

0+阅读 · 2022年6月28日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

ICLR2019最佳论文出炉

ICLR2019最佳论文出炉

专知

12+阅读 · 2019年5月6日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

从Seq2seq到Attention模型到Self Attention（二）

从Seq2seq到Attention模型到Self Attention（二）

量化投资与机器学习

23+阅读 · 2018年10月9日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

孤独症疾病相关突变位点高通量筛选及致病机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于概率图的文本检索模型及算法研究

国家自然科学基金

2+阅读 · 2014年12月31日

长链非编码RNA HULC在细胞凋亡中的作用和机制

国家自然科学基金

0+阅读 · 2013年12月31日

定性地理信息检索的模型与方法

国家自然科学基金

0+阅读 · 2012年12月31日

AMPK/自噬通路在骨髓间充质干细胞心肌保护中的作用及机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

面向属性的CPN建模及On the Fly辅助的测试生成方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

Cystatin B缺失与Prion疾病自噬作用机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

病理性近视易感基因研究

国家自然科学基金

0+阅读 · 2009年12月31日

肿瘤细胞凋亡过程中死亡受体DR5表达上调的分子机制

国家自然科学基金

0+阅读 · 2009年12月31日

问答式信息检索中信息抽取技术研究

国家自然科学基金

3+阅读 · 2008年12月31日

Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

Arxiv

0+阅读 · 2023年5月25日

Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

Arxiv

0+阅读 · 2023年5月24日

Introducing Competition to Boost the Transferability of Targeted Adversarial Examples through Clean Feature Mixup

Arxiv

0+阅读 · 2023年5月24日

NAIL: Lexical Retrieval Indices with Efficient Non-Autoregressive Decoders

Arxiv

0+阅读 · 2023年5月23日

Rethinking the Role of Token Retrieval in Multi-Vector Retrieval

Arxiv

0+阅读 · 2023年5月23日

Rethinking the Role of Pre-ranking in Large-scale E-Commerce Searching System

Arxiv

0+阅读 · 2023年5月23日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

A Graph-based Relevance Matching Model for Ad-hoc Retrieval

Arxiv

11+阅读 · 2021年1月28日

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Arxiv

10+阅读 · 2018年3月29日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

VIP会员

文章信息

相关主题

词元分析器

相关VIP内容

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

ACL2022 | 基于强化学习的实体对齐

ACL2022 | 基于强化学习的实体对齐

专知会员服务

35+阅读 · 2022年3月15日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

专知会员服务

29+阅读 · 2022年3月6日

【神经自然语言处理进展：建模，学习，推理】Progress in Neural NLP: Modeling, Learning, and Reasoning

【神经自然语言处理进展：建模，学习，推理】Progress in Neural NLP: Modeling, Learning, and Reasoning

专知会员服务

78+阅读 · 2020年8月13日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【北京大学】探索提取跨模态信息进行图像caption，Exploring and Distilling Cross-Modal Information for Image Captioning

【北京大学】探索提取跨模态信息进行图像caption，Exploring and Distilling Cross-Modal Information for Image Captioning

专知会员服务

54+阅读 · 2020年3月3日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【WWW2020】学习上下文化文档表示用于医疗答案检索，Learning Contextualized Document Representations for Healthcare Answer Retrieval

【WWW2020】学习上下文化文档表示用于医疗答案检索，Learning Contextualized Document Representations for Healthcare Answer Retrieval

专知会员服务

26+阅读 · 2020年2月10日

【文章|自注意力(self-attention)机制图解】《Illustrated: Self-Attention》by Raimi Karim

【文章|自注意力(self-attention)机制图解】《Illustrated: Self-Attention》by Raimi Karim

专知会员服务

45+阅读 · 2019年11月18日

热门VIP内容

开通专知VIP会员享更多权益服务

新质生成式AI赋能产业变革的实践与路径

用于多模态大模型的离散标记化：全面综述

Nature综述：金融网络中的物理学

【CMU博士论文】通信高效且差分隐私的优化方法

相关资讯

一文详解深度树检索技术（TDM）三部曲（与Deep Retrieval）

一文详解深度树检索技术（TDM）三部曲（与Deep Retrieval）

PaperWeekly

4+阅读 · 2022年9月21日

BERT为何无法彻底干掉BM25？？

BERT为何无法彻底干掉BM25？？

夕小瑶的卖萌屋

0+阅读 · 2022年6月28日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

ICLR2019最佳论文出炉

ICLR2019最佳论文出炉

专知

12+阅读 · 2019年5月6日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

从Seq2seq到Attention模型到Self Attention（二）

从Seq2seq到Attention模型到Self Attention（二）

量化投资与机器学习

23+阅读 · 2018年10月9日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

Arxiv

0+阅读 · 2023年5月25日

Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

Arxiv

0+阅读 · 2023年5月24日

Introducing Competition to Boost the Transferability of Targeted Adversarial Examples through Clean Feature Mixup

Arxiv

0+阅读 · 2023年5月24日

NAIL: Lexical Retrieval Indices with Efficient Non-Autoregressive Decoders

Arxiv

0+阅读 · 2023年5月23日

Rethinking the Role of Token Retrieval in Multi-Vector Retrieval

Arxiv

0+阅读 · 2023年5月23日

Rethinking the Role of Pre-ranking in Large-scale E-Commerce Searching System

Arxiv

0+阅读 · 2023年5月23日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

A Graph-based Relevance Matching Model for Ad-hoc Retrieval

Arxiv

11+阅读 · 2021年1月28日

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Arxiv

10+阅读 · 2018年3月29日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

相关基金

孤独症疾病相关突变位点高通量筛选及致病机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于概率图的文本检索模型及算法研究

国家自然科学基金

2+阅读 · 2014年12月31日

长链非编码RNA HULC在细胞凋亡中的作用和机制

国家自然科学基金

0+阅读 · 2013年12月31日

定性地理信息检索的模型与方法

国家自然科学基金

0+阅读 · 2012年12月31日

AMPK/自噬通路在骨髓间充质干细胞心肌保护中的作用及机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

面向属性的CPN建模及On the Fly辅助的测试生成方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

Cystatin B缺失与Prion疾病自噬作用机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

病理性近视易感基因研究

国家自然科学基金

0+阅读 · 2009年12月31日

肿瘤细胞凋亡过程中死亡受体DR5表达上调的分子机制

国家自然科学基金

0+阅读 · 2009年12月31日

问答式信息检索中信息抽取技术研究

国家自然科学基金

3+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员