当证据相互矛盾时：迈向医疗保健领域更安全的检索增强生成 (When Evidence Contradicts: Toward Safer Retrieval-Augmented Generation in Healthcare)

In high-stakes information domains such as healthcare, where large language models (LLMs) can produce hallucinations or misinformation, retrieval-augmented generation (RAG) has been proposed as a mitigation strategy, grounding model outputs in external, domain-specific documents. Yet, this approach can introduce errors when source documents contain outdated or contradictory information. This work investigates the performance of five LLMs in generating RAG-based responses to medicine-related queries. Our contributions are three-fold: i) the creation of a benchmark dataset using consumer medicine information documents from the Australian Therapeutic Goods Administration (TGA), where headings are repurposed as natural language questions, ii) the retrieval of PubMed abstracts using TGA headings, stratified across multiple publication years, to enable controlled temporal evaluation of outdated evidence, and iii) a comparative analysis of the frequency and impact of outdated or contradictory content on model-generated responses, assessing how LLMs integrate and reconcile temporally inconsistent information. Our findings show that contradictions between highly similar abstracts do, in fact, degrade performance, leading to inconsistencies and reduced factual accuracy in model answers. These results highlight that retrieval similarity alone is insufficient for reliable medical RAG and underscore the need for contradiction-aware filtering strategies to ensure trustworthy responses in high-stakes domains.

翻译：在医疗保健等高风险信息领域，大型语言模型（LLM）可能产生幻觉或错误信息，检索增强生成（RAG）作为一种缓解策略被提出，将模型输出基于外部领域特定文档进行锚定。然而，当源文档包含过时或矛盾信息时，这种方法可能引入错误。本研究调查了五种LLM在生成基于RAG的医疗相关查询响应时的表现。我们的贡献包括三个方面：i) 利用澳大利亚治疗商品管理局（TGA）的消费者药品信息文档创建基准数据集，将其标题重新用作自然语言问题；ii) 使用TGA标题检索PubMed摘要，并按多个出版年份分层，以实现对过时证据的受控时间评估；iii) 对过时或矛盾内容在模型生成响应中的频率及影响进行比较分析，评估LLM如何整合与调和时间不一致的信息。我们的研究结果表明，高度相似摘要之间的矛盾确实会降低性能，导致模型答案不一致并降低事实准确性。这些结果凸显，仅依赖检索相似性不足以实现可靠的医疗RAG，并强调需要采用矛盾感知的过滤策略，以确保高风险领域中的可信响应。