Machine unlearning aims to remove the influence of specific training data from a model without requiring full retraining. This capability is crucial for ensuring privacy, safety, and regulatory compliance. Therefore, verifying whether a model has truly forgotten target data is essential for maintaining reliability and trustworthiness. However, existing evaluation methods often assess forgetting at the level of individual inputs. This approach may overlook residual influence present in semantically similar examples. Such influence can compromise privacy and lead to indirect information leakage. We propose REMIND (Residual Memorization In Neighborhood Dynamics), a novel evaluation method aiming to detect the subtle remaining influence of unlearned data and classify whether the data has been effectively forgotten. REMIND analyzes the model's loss over small input variations and reveals patterns unnoticed by single-point evaluations. We show that unlearned data yield flatter, less steep loss landscapes, while retained or unrelated data exhibit sharper, more volatile patterns. REMIND requires only query-based access, outperforms existing methods under similar constraints, and demonstrates robustness across different models, datasets, and paraphrased inputs, making it practical for real-world deployment. By providing a more sensitive and interpretable measure of unlearning effectiveness, REMIND provides a reliable framework to assess unlearning in language models. As a result, REMIND offers a novel perspective on memorization and unlearning.
翻译:机器遗忘旨在无需完全重新训练的情况下,从模型中移除特定训练数据的影响。这一能力对于确保隐私、安全性和法规遵从性至关重要。因此,验证模型是否真正遗忘目标数据对于维护可靠性和可信度至关重要。然而,现有的评估方法通常在单个输入层面评估遗忘效果。这种方法可能忽略语义相似示例中存在的残留影响。此类影响可能损害隐私并导致间接信息泄露。我们提出REMIND(邻域动态中的残留记忆),一种新颖的评估方法,旨在检测已遗忘数据的细微残留影响,并分类数据是否被有效遗忘。REMIND通过分析模型在小幅度输入变化下的损失,揭示单点评估未能察觉的模式。我们证明,已遗忘数据产生更平坦、梯度更平缓的损失景观,而保留或无关数据则呈现更尖锐、波动更大的模式。REMIND仅需基于查询的访问,在类似约束条件下优于现有方法,并在不同模型、数据集和释义输入中展现出鲁棒性,使其适用于实际部署。通过提供更敏感且可解释的遗忘有效性度量,REMIND为评估语言模型中的遗忘提供了一个可靠框架。因此,REMIND为记忆与遗忘提供了新的视角。