需求到代码可追溯性的协同增强：结合大语言模型数据增强与先进编码器的框架 (Synergistic Enhancement of Requirement-to-Code Traceability: A Framework Combining Large Language Model based Data Augmentation and an Advanced Encoder)

Synergistic Enhancement of Requirement-to-Code Traceability: A Framework Combining Large Language Model based Data Augmentation and an Advanced Encoder

翻译：需求到代码可追溯性的协同增强：结合大语言模型数据增强与先进编码器的框架

Jianzhang Zhang,Jialong Zhou,Nan Niu,Jinping Hua,Chuang Liu

Automated requirement-to-code traceability link recovery, essential for industrial system quality and safety, is critically hindered by the scarcity of labeled data. To address this bottleneck, this paper proposes and validates a synergistic framework that integrates large language model (LLM)-driven data augmentation with an advanced encoder. We first demonstrate that data augmentation, optimized through a systematic evaluation of bi-directional and zero/few-shot prompting strategies, is highly effective, while the choice among leading LLMs is not a significant performance factor. Building on the augmented data, we further enhance an established, state-of-the-art pre-trained language model based method by incorporating an encoder distinguished by a broader pre-training corpus and an extended context window. Our experiments on four public datasets quantify the distinct contributions of our framework's components: on its own, data augmentation consistently improves the baseline method, providing substantial performance gains of up to 26.66%; incorporating the advanced encoder provides an additional lift of 2.21% to 11.25%. This synergy culminates in a fully optimized framework with maximum gains of up to 28.59% on $F_1$ score and 28.9% on $F_2$ score over the established baseline, decisively outperforming ten established baselines from three dominant paradigms. This work contributes a pragmatic and scalable methodology to overcome the data scarcity bottleneck, paving the way for broader industrial adoption of data-driven requirement-to-code traceability.

翻译：自动化需求到代码可追溯性链接恢复对于工业系统的质量与安全至关重要，但标记数据的稀缺严重阻碍了其发展。为应对这一瓶颈，本文提出并验证了一个协同框架，该框架整合了大语言模型驱动的数据增强与一种先进编码器。我们首先证明，通过系统评估双向及零样本/少样本提示策略而优化的数据增强方法极为有效，而领先大语言模型之间的选择并非显著的性能影响因素。基于增强数据，我们进一步改进了一种成熟的、基于预训练语言模型的先进方法，通过融入一个以更广泛预训练语料库和更长上下文窗口为特点的编码器。我们在四个公共数据集上的实验量化了框架各组成部分的独立贡献：数据增强本身能持续改进基线方法，提供高达26.66%的显著性能提升；结合先进编码器可带来额外2.21%至11.25%的提升。这种协同效应最终形成了一个完全优化的框架，在$F_1$分数和$F_2$分数上相比既定基线最大提升分别达到28.59%和28.9%，显著优于来自三种主流范式的十种现有基线方法。本工作贡献了一种实用且可扩展的方法论来克服数据稀缺瓶颈，为数据驱动的需求到代码可追溯性在工业界的更广泛采用铺平了道路。