Machine learning models constitute valuable intellectual property, yet remain vulnerable to model extraction attacks (MEA), where adversaries replicate their functionality through black-box queries. Model watermarking counters MEAs by embedding forensic markers for ownership verification. Current black-box watermarks prioritize MEA survival through representation entanglement, yet inadequately explore resilience against sequential MEAs and removal attacks. Our study reveals that this risk is underestimated because existing removal methods are weakened by entanglement. To address this gap, we propose Watermark Removal attacK (WRK), which circumvents entanglement constraints by exploiting decision boundaries shaped by prevailing sample-level watermark artifacts. WRK effectively reduces watermark success rates by at least 88.79% across existing watermarking benchmarks. For robust protection, we propose Class-Feature Watermarks (CFW), which improve resilience by leveraging class-level artifacts. CFW constructs a synthetic class using out-of-domain samples, eliminating vulnerable decision boundaries between original domain samples and their artifact-modified counterparts (watermark samples). CFW concurrently optimizes both MEA transferability and post-MEA stability. Experiments across multiple domains show that CFW consistently outperforms prior methods in resilience, maintaining a watermark success rate of at least 70.15% in extracted models even under the combined MEA and WRK distortion, while preserving the utility of protected models.
翻译:机器学习模型作为高价值知识产权,仍面临模型提取攻击的威胁,攻击者可通过黑盒查询复制其功能。模型水印技术通过嵌入取证标记以验证所有权,从而应对模型提取攻击。现有黑盒水印方法主要通过表征纠缠确保在模型提取攻击下的存活性,但对连续模型提取攻击及移除攻击的鲁棒性探索不足。本研究揭示,由于现有移除方法受纠缠机制削弱,相关风险被低估。为填补这一空白,我们提出水印移除攻击方法,该方法通过利用主流样本级水印伪影形成的决策边界,规避纠缠约束。在现有水印基准测试中,WRK能有效将水印成功率降低至少88.79%。为实现鲁棒保护,我们提出类特征水印方法,该方法利用类级伪影提升鲁棒性。CFW通过域外样本构建合成类,消除原始域样本与其伪影修改对应样本之间的脆弱决策边界。CFW同时优化模型提取攻击的可迁移性与攻击后稳定性。跨多领域实验表明,CFW在鲁棒性上持续优于现有方法,即使在模型提取攻击与WRK畸变联合作用下,仍能在提取模型中保持至少70.15%的水印成功率,同时保护模型的实用性。