As large AI models become increasingly valuable assets, the risk of model weight exfiltration from inference servers grows accordingly. An attacker controlling an inference server may exfiltrate model weights by hiding them within ordinary model outputs, a strategy known as steganography. This work investigates how to verify model responses to defend against such attacks and, more broadly, to detect anomalous or buggy behavior during inference. We formalize model exfiltration as a security game, propose a verification framework that can provably mitigate steganographic exfiltration, and specify the trust assumptions associated with our scheme. To enable verification, we characterize valid sources of non-determinism in large language model inference and introduce two practical estimators for them. We evaluate our detection framework on several open-weight models ranging from 3B to 30B parameters. On MOE-Qwen-30B, our detector reduces exfiltratable information to <0.5% with false-positive rate of 0.01%, corresponding to a >200x slowdown for adversaries. Overall, this work further establishes a foundation for defending against model weight exfiltration and demonstrates that strong protection can be achieved with minimal additional cost to inference providers.
翻译:随着大型人工智能模型日益成为高价值资产,从推理服务器窃取模型权重的风险也相应增加。攻击者若控制推理服务器,可通过将模型权重隐藏于常规模型输出中进行窃取,这种策略被称为隐写术。本研究探讨如何通过验证模型响应来防御此类攻击,并更广泛地检测推理过程中的异常或错误行为。我们将模型窃取形式化为安全博弈,提出一种可证明缓解隐写窃取的验证框架,并明确该方案所需的信任假设。为实现验证,我们系统分析了大语言模型推理中非确定性来源的合法特征,并引入两种实用的估计器。我们在多个参数量从30亿到300亿的开源权重模型上评估了检测框架。在MOE-Qwen-30B模型中,检测器将可窃取信息降至<0.5%,误报率为0.01%,相当于使攻击者窃取速度降低超过200倍。总体而言,本研究为防御模型权重窃取奠定了更坚实的基础,并证明推理服务提供商能以极低附加成本实现强效防护。