学习分词进行生成式检索 (Learning to Tokenize for Generative Retrieval)

Conventional document retrieval techniques are mainly based on the index-retrieve paradigm. It is challenging to optimize pipelines based on this paradigm in an end-to-end manner. As an alternative, generative retrieval represents documents as identifiers (docid) and retrieves documents by generating docids, enabling end-to-end modeling of document retrieval tasks. However, it is an open question how one should define the document identifiers. Current approaches to the task of defining document identifiers rely on fixed rule-based docids, such as the title of a document or the result of clustering BERT embeddings, which often fail to capture the complete semantic information of a document. We propose GenRet, a document tokenization learning method to address the challenge of defining document identifiers for generative retrieval. GenRet learns to tokenize documents into short discrete representations (i.e., docids) via a discrete auto-encoding approach. Three components are included in GenRet: (i) a tokenization model that produces docids for documents; (ii) a reconstruction model that learns to reconstruct a document based on a docid; and (iii) a sequence-to-sequence retrieval model that generates relevant document identifiers directly for a designated query. By using an auto-encoding framework, GenRet learns semantic docids in a fully end-to-end manner. We also develop a progressive training scheme to capture the autoregressive nature of docids and to stabilize training. We conduct experiments on the NQ320K, MS MARCO, and BEIR datasets to assess the effectiveness of GenRet. GenRet establishes the new state-of-the-art on the NQ320K dataset. Especially, compared to generative retrieval baselines, GenRet can achieve significant improvements on the unseen documents. GenRet also outperforms comparable baselines on MS MARCO and BEIR, demonstrating the method's generalizability.

翻译：传统的文档检索技术主要基于索引-检索范式。在端到端的方式下优化这种范式的管道是具有挑战性的。作为一种替代方法，生成式检索将文档表示为标识符（docid），并通过生成docids来检索文档，从而实现文档检索任务的端到端建模。然而，如何定义文档标识符仍然是一个开放性问题。目前的方法依赖于固定的基于规则的docids，例如文档的标题或聚类BERT嵌入的结果，这些方法经常无法捕捉文档的完整语义信息。我们提出了一种名为GenRet的文档分词学习方法来解决为生成式检索定义文档标识符的挑战。GenRet通过离散自编码方法学习将文档分词成短离散表示（即docids）。GenRet包含三个组件：（i）一个产生文档的docids的分词模型;（ii）一个基于docid学习重构文档的重构模型;以及（iii）一个序列到序列的检索模型，用于直接为指定的查询生成相关的文档标识符。通过使用自编码方法，GenRet可以以完全端到端的方式学习语义docids。我们还开发了一个逐步训练方案，以捕捉docids自回归的性质，并稳定训练。我们在NQ320K，MS MARCO和BEIR数据集上进行实验，以评估GenRet的有效性。GenRet在NQ320K数据集上建立了新的最先进水平。特别是，与生成式检索基线相比，在未知文档上，GenRet可以实现显著的改进。GenRet在MS MARCO和BEIR上也优于可比较的基线，证明该方法的可适用性。