基于聚类引导的LLM软件分析数据匿名化方法：研究JIT缺陷预测中的隐私-效用权衡 (Cluster-guided LLM-Based Anonymization of Software Analytics Data: Studying Privacy-Utility Trade-offs in JIT Defect Prediction)

The increasing use of machine learning (ML) for Just-In-Time (JIT) defect prediction raises concerns about privacy leakage from software analytics data. Existing anonymization methods, such as tabular transformations and graph perturbations, often overlook contextual dependencies among software metrics, leading to suboptimal privacy-utility tradeoffs. Leveraging the contextual reasoning of Large Language Models (LLMs), we propose a cluster-guided anonymization technique that preserves contextual and statistical relationships within JIT datasets. Our method groups commits into feature-based clusters and employs an LLM to generate context-aware parameter configurations for each commit cluster, defining alpha-beta ratios and churn mixture distributions used for anonymization. Our evaluation on six projects (Cassandra, Flink, Groovy, Ignite, OpenStack, and Qt) shows that our LLM-based approach achieves privacy level 2 (IPR >= 80 percent), improving privacy by 18 to 25 percent over four state-of-the-art graph-based anonymization baselines while maintaining comparable F1 scores. Our results demonstrate that LLMs can act as adaptive anonymization engines when provided with cluster-specific statistical information about similar data points, enabling context-sensitive and privacy-preserving software analytics without compromising predictive accuracy.

翻译：机器学习在即时缺陷预测中的广泛应用引发了软件分析数据隐私泄露的担忧。现有的匿名化方法（如表格转换和图扰动）常忽略软件度量间的上下文依赖关系，导致隐私-效用权衡欠佳。本研究利用大语言模型的上下文推理能力，提出一种聚类引导的匿名化技术，以保持JIT数据集中的上下文与统计关系。该方法将代码提交按特征聚类，并采用LLM为每个提交簇生成上下文感知的参数配置，定义用于匿名化的α-β比率与变更混合分布。在六个项目（Cassandra、Flink、Groovy、Ignite、OpenStack和Qt）上的评估表明，基于LLM的方法实现了隐私等级2（IPR≥80%），相比四种先进的基于图的匿名化基线方法，隐私保护度提升18%至25%，同时保持相当的F1分数。实验结果表明，当提供针对相似数据点的聚类特定统计信息时，LLM可作为自适应匿名化引擎，在保障预测精度的前提下实现上下文敏感且隐私保护的软件分析。

相关内容