基于上下文探测的微调语言模型成员推断攻击 (In-Context Probing for Membership Inference in Fine-Tuned Language Models)

Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs), especially when models are adapted to domain-specific tasks using sensitive data. While prior black-box MIA techniques rely on confidence scores or token likelihoods, these signals are often entangled with a sample's intrinsic properties - such as content difficulty or rarity - leading to poor generalization and low signal-to-noise ratios. In this paper, we propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics, particularly the phenomenon of diminishing returns during optimization. We introduce the Optimization Gap as a fundamental signal of membership: at convergence, member samples exhibit minimal remaining loss-reduction potential, while non-members retain significant potential for further optimization. To estimate this gap in a black-box setting, we propose In-Context Probing (ICP), a training-free method that simulates fine-tuning-like behavior via strategically constructed input contexts. We propose two probing strategies: reference-data-based (using semantically similar public samples) and self-perturbation (via masking or generation). Experiments on three tasks and multiple LLMs show that ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates. We further analyze how reference data alignment, model type, PEFT configurations, and training schedules affect attack effectiveness. Our findings establish ICP-MIA as a practical and theoretically grounded framework for auditing privacy risks in deployed LLMs.

翻译：成员推断攻击对微调后的大语言模型构成严重的隐私威胁，尤其是在模型使用敏感数据适配领域特定任务时。现有的黑盒成员推断攻击技术主要依赖置信度分数或词元似然度，但这些信号常与样本的固有属性（如内容难度或稀有性）相互纠缠，导致泛化能力差且信噪比低。本文提出ICP-MIA，一种基于训练动态理论（特别是优化过程中收益递减现象）的新型成员推断攻击框架。我们引入优化间隙作为成员身份的基本信号：在收敛状态下，成员样本的损失降低潜力已趋近于零，而非成员样本仍具有显著的优化空间。为在黑盒场景中估计该间隙，我们提出上下文探测方法，这是一种无需训练即可通过策略性构建的输入上下文模拟微调行为的技术。我们设计了两种探测策略：基于参考数据（使用语义相似的公开样本）和自扰动（通过掩码或生成）。在三个任务和多种大语言模型上的实验表明，ICP-MIA显著优于现有黑盒成员推断攻击方法，尤其在低误报率场景下表现突出。我们进一步分析了参考数据对齐度、模型类型、参数高效微调配置及训练计划对攻击效果的影响。本研究确立ICP-MIA为一种兼具理论依据与实践价值的框架，可用于评估已部署大语言模型的隐私风险。