Large language models have shown remarkable language processing and reasoning ability but are prone to hallucinate when asked about private data. Retrieval-augmented generation (RAG) retrieves relevant data that fit into an LLM's context window and prompts the LLM for an answer. GraphRAG extends this approach to structured Knowledge Graphs (KGs) and questions regarding entities multiple hops away. The majority of recent GraphRAG methods either overlook the retrieval step or have ad hoc retrieval processes that are abstract or inefficient. This prevents them from being adopted when the KGs are stored in graph databases supporting graph query languages. In this work, we present GraphRAFT, a retrieve-and-reason framework that finetunes LLMs to generate provably correct Cypher queries to retrieve high-quality subgraph contexts and produce accurate answers. Our method is the first such solution that can be taken off-the-shelf and used on KGs stored in native graph DBs. Benchmarks suggest that our method is sample-efficient and scales with the availability of training data. Our method achieves significantly better results than all state-of-the-art models across all four standard metrics on two challenging Q&As on large text-attributed KGs.
翻译:大型语言模型展现出卓越的语言处理与推理能力,但在涉及私有数据时易产生幻觉。检索增强生成通过检索适配大语言模型上下文窗口的相关数据,并提示大语言模型生成答案。GraphRAG将这一方法扩展至结构化知识图谱,用于处理涉及多跳实体的查询。当前多数GraphRAG方法或忽略检索步骤,或采用抽象且低效的临时检索流程,导致其难以应用于支持图查询语言的知识图谱数据库场景。本研究提出GraphRAFT框架,通过微调大语言模型生成可验证正确的Cypher查询语句,以检索高质量子图上下文并生成精确答案。该方法首次实现开箱即用,可直接应用于原生图数据库存储的知识图谱。基准测试表明,该方法具有样本高效性,且性能随训练数据规模提升而增强。在两个大型文本属性知识图谱的复杂问答任务中,本方法在四项标准指标上均显著优于所有现有先进模型。