Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models, yet its effectiveness remains limited in complex, multi-step reasoning scenarios. Recent efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval. Most approaches rely on outcome-based supervision, offering no explicit guidance for intermediate steps. This often leads to reward hacking and degraded response quality. We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions. To assess the information completeness of each step, we introduce a bidirectional information distance grounded in Kolmogorov complexity, approximated via language model generation probabilities. This quantification measures both how far the current reasoning is from the answer and how well it addresses the question. To optimize reasoning under these bidirectional signals, we adopt a multi-objective reinforcement learning framework with a cascading reward structure that emphasizes early trajectory alignment. Empirical results on seven question answering benchmarks demonstrate that Bi-RAR surpasses previous methods and enables efficient interaction and reasoning with the search engine during training and inference.
翻译:检索增强生成(RAG)已被证明能有效缓解大语言模型中的幻觉问题,但在复杂多步推理场景中,其效果仍有限。近期研究尝试将基于搜索的交互融入RAG,实现结合实时检索的迭代式推理。现有方法多依赖结果监督,未能为中间步骤提供显式指导,常导致奖励破解与响应质量下降。本文提出Bi-RAR——一种新颖的检索增强推理框架,通过前向与后向双重视角联合评估每个中间推理步骤。为量化各步骤的信息完备性,我们基于柯尔莫哥洛夫复杂度提出双向信息距离度量,并借助语言模型生成概率进行近似计算。该量化方法同时衡量当前推理结果与答案的差距,以及其对问题的回应程度。为在此双向信号下优化推理过程,我们采用具有级联奖励结构的多目标强化学习框架,强调早期轨迹对齐。在七个问答基准上的实验结果表明,Bi-RAR优于现有方法,并在训练与推理过程中实现了与搜索引擎的高效交互推理。