Given the high cost of large language model (LLM) training from scratch, safeguarding LLM intellectual property (IP) has become increasingly crucial. As the standard paradigm for IP ownership verification, LLM fingerprinting thus plays a vital role in addressing this challenge. Existing LLM fingerprinting methods verify ownership by extracting or injecting model-specific features. However, they overlook potential attacks during the verification process, leaving them ineffective when the model thief fully controls the LLM's inference process. In such settings, attackers may share prompt-response pairs to enable fingerprint unlearning or manipulate outputs to evade exact-match verification. We propose iSeal, the first fingerprinting method designed for reliable verification when the model thief controls the suspected LLM in an end-to-end manner. It injects unique features into both the model and an external module, reinforced by an error-correction mechanism and a similarity-based verification strategy. These components are resistant to verification-time attacks, including collusion-based fingerprint unlearning and response manipulation, backed by both theoretical analysis and empirical results. iSeal achieves 100 percent Fingerprint Success Rate (FSR) on 12 LLMs against more than 10 attacks, while baselines fail under unlearning and response manipulations.
翻译:鉴于从头训练大语言模型(LLM)的高昂成本,保护LLM知识产权(IP)已变得日益关键。作为IP所有权验证的标准范式,LLM指纹识别技术因此在应对这一挑战中发挥着至关重要的作用。现有的LLM指纹识别方法通过提取或注入模型特定特征来验证所有权,但它们忽略了验证过程中可能遭受的攻击,导致当模型窃取者完全控制LLM推理过程时,这些方法将失效。在此类场景下,攻击者可能共享提示-响应对以实现指纹遗忘,或操纵输出以规避精确匹配验证。我们提出了iSeal,这是首个专为在模型窃取者以端到端方式控制可疑LLM时实现可靠验证的指纹识别方法。该方法将独特特征同时注入模型和外部模块,并通过纠错机制和基于相似度的验证策略进行强化。这些组件能够抵抗验证阶段的攻击,包括基于共谋的指纹遗忘和响应操纵,其有效性得到了理论分析和实证结果的支持。iSeal在12个LLM上针对超过10种攻击实现了100%的指纹成功率(FSR),而基线方法在遗忘和响应操纵攻击下均告失败。