Model fingerprint detection techniques have emerged as a promising approach for attributing AI-generated images to their source models, but their robustness under adversarial conditions remains largely unexplored. We present the first systematic security evaluation of these techniques, formalizing threat models that encompass both white- and black-box access and two attack goals: fingerprint removal, which erases identifying traces to evade attribution, and fingerprint forgery, which seeks to cause misattribution to a target model. We implement five attack strategies and evaluate 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 12 state-of-the-art image generators. Our experiments reveal a pronounced gap between clean and adversarial performance. Removal attacks are highly effective, often achieving success rates above 80% in white-box settings and over 50% under constrained black-box access. While forgery is more challenging than removal, its success significantly varies across targeted models. We also identify a utility-robustness trade-off: methods with the highest attribution accuracy are often vulnerable to attacks. Although some techniques exhibit robustness in specific settings, none achieves high robustness and accuracy across all evaluated threat models. These findings highlight the need for techniques balancing robustness and accuracy, and identify the most promising approaches for advancing this goal.
翻译:模型指纹检测技术已成为将AI生成图像溯源至其源模型的一种有前景的方法,但其在对抗条件下的鲁棒性在很大程度上尚未得到探索。我们首次对这些技术进行了系统性安全评估,形式化了涵盖白盒与黑盒访问权限的威胁模型,以及两个攻击目标:指纹移除(旨在消除识别痕迹以逃避溯源)和指纹伪造(旨在导致错误溯源至目标模型)。我们实施了五种攻击策略,并在12个最先进的图像生成器上,评估了涵盖RGB、频域和学习特征领域的14种代表性指纹方法。实验结果显示,干净环境与对抗环境下的性能存在显著差距。移除攻击效果显著,在白盒设置下通常成功率超过80%,在受限的黑盒访问下也超过50%。尽管伪造比移除更具挑战性,但其成功率在目标模型间差异显著。我们还发现了效用与鲁棒性之间的权衡:具有最高溯源准确性的方法往往易受攻击。尽管某些技术在特定设置下表现出鲁棒性,但没有任何方法在所有评估的威胁模型中同时实现高鲁棒性和高准确性。这些发现强调了需要平衡鲁棒性与准确性的技术,并指出了推进这一目标最有前景的途径。