Large Language Model (LLM) agents increasingly rely on long-term memory and Retrieval-Augmented Generation (RAG) to persist experiences and refine future performance. While this experience learning capability enhances agentic autonomy, it introduces a critical, unexplored attack surface, i.e., the trust boundary between an agent's reasoning core and its own past. In this paper, we introduce MemoryGraft. It is a novel indirect injection attack that compromises agent behavior not through immediate jailbreaks, but by implanting malicious successful experiences into the agent's long-term memory. Unlike traditional prompt injections that are transient, or standard RAG poisoning that targets factual knowledge, MemoryGraft exploits the agent's semantic imitation heuristic which is the tendency to replicate patterns from retrieved successful tasks. We demonstrate that an attacker who can supply benign ingestion-level artifacts that the agent reads during execution can induce it to construct a poisoned RAG store where a small set of malicious procedure templates is persisted alongside benign experiences. When the agent later encounters semantically similar tasks, union retrieval over lexical and embedding similarity reliably surfaces these grafted memories, and the agent adopts the embedded unsafe patterns, leading to persistent behavioral drift across sessions. We validate MemoryGraft on MetaGPT's DataInterpreter agent with GPT-4o and find that a small number of poisoned records can account for a large fraction of retrieved experiences on benign workloads, turning experience-based self-improvement into a vector for stealthy and durable compromise. To facilitate reproducibility and future research, our code and evaluation data are available at https://github.com/Jacobhhy/Agent-Memory-Poisoning.
翻译:大型语言模型(LLM)智能体日益依赖长期记忆与检索增强生成(RAG)来持久化经验并优化未来性能。尽管这种经验学习能力增强了智能体的自主性,却也引入了一个关键且尚未被探索的攻击面,即智能体推理核心与其自身过往经验之间的信任边界。本文提出MemoryGraft,这是一种新颖的间接注入攻击,其危害智能体行为的方式并非通过即时越狱,而是通过将恶意的成功经验植入智能体的长期记忆中。与短暂存在的传统提示注入或针对事实性知识的标准RAG污染不同,MemoryGraft利用了智能体的语义模仿启发式——即倾向于复制从检索到的成功任务中学习到的模式。我们证明,攻击者若能提供智能体在执行过程中读取的良性摄入级工件,便可诱导其构建一个受污染的RAG存储库,其中少量恶意过程模板将与良性经验一同被持久保存。当智能体后续遇到语义相似的任务时,基于词法和嵌入相似性的联合检索会可靠地唤起这些被嫁接的记忆,智能体将采纳其中嵌入的不安全模式,从而导致跨会话的持续性行为偏移。我们在基于GPT-4o的MetaGPT DataInterpreter智能体上验证了MemoryGraft,发现少量污染记录即可在良性工作负载中占据检索经验的很大比例,从而使基于经验的自我改进转变为隐秘且持久的危害载体。为促进可复现性与未来研究,我们的代码与评估数据已公开于https://github.com/Jacobhhy/Agent-Memory-Poisoning。