The rapid deployment of Large Language Models (LLMs) has created an urgent need for enhanced security and privacy measures in Machine Learning (ML). LLMs are increasingly being used to process untrusted text inputs and even generate executable code, often while having access to sensitive system controls. To address these security concerns, several companies have introduced guard models, which are smaller, specialized models designed to protect text generation models from adversarial or malicious inputs. In this work, we advance the study of adversarial inputs by introducing Super Suffixes, suffixes capable of overriding multiple alignment objectives across various models with different tokenization schemes. We demonstrate their effectiveness, along with our joint optimization technique, by successfully bypassing the protection mechanisms of Llama Prompt Guard 2 on five different text generation models for malicious text and code generation. To the best of our knowledge, this is the first work to reveal that Llama Prompt Guard 2 can be compromised through joint optimization. Additionally, by analyzing the changing similarity of a model's internal state to specific concept directions during token sequence processing, we propose an effective and lightweight method to detect Super Suffix attacks. We show that the cosine similarity between the residual stream and certain concept directions serves as a distinctive fingerprint of model intent. Our proposed countermeasure, DeltaGuard, significantly improves the detection of malicious prompts generated through Super Suffixes. It increases the non-benign classification rate to nearly 100%, making DeltaGuard a valuable addition to the guard model stack and enhancing robustness against adversarial prompt attacks.
翻译:大规模语言模型(LLMs)的快速部署,对机器学习(ML)领域的安全与隐私保护提出了迫切需求。LLMs 正被越来越多地用于处理不可信的文本输入,甚至生成可执行代码,同时往往能够访问敏感的系统控制权限。为应对这些安全挑战,多家公司引入了防护模型——即规模较小、专门设计用于保护文本生成模型免受对抗性或恶意输入侵害的专用模型。在本研究中,我们通过引入“超级后缀”推进了对对抗性输入的研究,此类后缀能够覆盖采用不同分词方案的各种模型的多个对齐目标。我们通过成功绕过Llama Prompt Guard 2在五个不同文本生成模型上针对恶意文本和代码生成的保护机制,展示了超级后缀及其联合优化技术的有效性。据我们所知,这是首个揭示Llama Prompt Guard 2可通过联合优化被攻破的研究。此外,通过分析模型在处理词元序列时内部状态相对于特定概念方向相似度的动态变化,我们提出了一种高效轻量的超级后缀攻击检测方法。研究表明,残差流与特定概念方向之间的余弦相似度可作为模型意图的独特指纹。我们提出的应对措施DeltaGuard显著提升了针对超级后缀生成的恶意提示的检测能力,将非良性分类率提升至近100%,使得DeltaGuard成为防护模型栈的有力补充,并增强了对对抗性提示攻击的鲁棒性。