Large decoder-only language models (LMs) can be largely improved in terms of perplexity by retrieval (e.g., RETRO), but its impact on text generation quality and downstream task accuracy is unclear. Thus, it is still an open question: shall we pretrain large autoregressive LMs with retrieval? To answer it, we perform a comprehensive study on a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT incorporated at fine-tuning or inference stages. We first provide the recipe to reproduce RETRO up to 9.5B parameters while retrieving a text corpus with 330B tokens. Based on that, we have the following novel findings: i) RETRO outperforms GPT on text generation with much less degeneration (i.e., repetition), moderately higher factual accuracy, and slightly lower toxicity with a nontoxic retrieval database. ii) On the LM Evaluation Harness benchmark, RETRO largely outperforms GPT on knowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore, we introduce a simple variant of the model, RETRO++, which largely improves open-domain QA results of original RETRO (e.g., EM score +8.6 on Natural Question) and significantly outperforms retrieval-augmented GPT across different model sizes. Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models. We release our implementation at: https://github.com/NVIDIA/Megatron-LM#retro
翻译:大型仅具有解码器的语言模型可以通过检索(例如RETRO)大大提高困惑度,但其对文本生成质量和下游任务准确性的影响尚不清楚。因此,仍然存在一个开放性问题:我们应该预训练使用检索的大型自回归语言模型吗?为了回答这个问题,我们对可扩展的预训练检索增强LM(即RETRO),标准GPT和在微调或推理阶段引入的检索增强GPT进行了全面的研究。我们首先提供了复制RETRO的配方,最多可以检索包含330B个标记的文本语料库的9.5B参数。基于此,我们有以下新发现:i)RETRO在文本生成方面优于GPT,具有更少的退化(即重复),适度更高的实际准确性和略低的毒性,使用无害的检索数据库。 ii)基于LM Evaluation Harness基准测试,RETRO在知识密集型任务上大大优于GPT,但在其他任务上与GPT相当。此外,我们引入了模型的简单变体RETRO ++,它大大改进了原始RETRO的开放域QA结果(例如自然问题上的EM分数+8.6),并且在不同的模型大小上明显优于检索增强GPT。我们的研究结果强调了使用检索预训练自回归语言模型作为未来基础模型的有前途的方向。我们在以下链接中发布实现 https://github.com/NVIDIA/Megatron-LM#retro