Retrieval-augmented generation (RAG) enhances large language models (LLMs) with retrieved context but often suffers from downgraded prefill performance as modern applications demand longer and more complex inputs. Existing caching techniques either preserve accuracy with low cache reuse or improve reuse at the cost of degraded reasoning quality. We present RAGBoost, an efficient RAG system that achieves high cache reuse without sacrificing accuracy through accuracy-preserving context reuse. RAGBoost detects overlapping retrieved items across concurrent sessions and multi-turn interactions, using efficient context indexing, ordering, and de-duplication to maximize reuse, while lightweight contextual hints maintain reasoning fidelity. It integrates seamlessly with existing LLM inference engines and improves their prefill performance by 1.5-3X over state-of-the-art methods, while preserving or even enhancing reasoning accuracy across diverse RAG and agentic AI workloads. Our code is released at: https://github.com/Edinburgh-AgenticAI/RAGBoost.
翻译:检索增强生成(RAG)通过检索上下文增强大语言模型(LLM)的能力,但随着现代应用对更长、更复杂输入的需求,其预填充性能常出现下降。现有缓存技术要么以低缓存重用率为代价保持精度,要么以提高重用率为代价降低推理质量。本文提出RAGBoost,一种高效的RAG系统,通过精度保持的上下文重用机制,在不牺牲精度的前提下实现高缓存重用率。RAGBoost通过高效的上下文索引、排序与去重技术,检测并发会话和多轮交互中重叠的检索项以最大化重用,同时通过轻量级上下文提示保持推理保真度。该系统可与现有LLM推理引擎无缝集成,在多样化RAG与智能体AI工作负载中,相比前沿方法将预填充性能提升1.5-3倍,同时保持甚至提升推理精度。代码已发布于:https://github.com/Edinburgh-AgenticAI/RAGBoost。