Despite rapid progress in logic locking (LL), reproducibility remains a challenge as codes are rarely made public. We present LockForge, a first-of-its-kind, multi-agent large language model (LLM) framework that turns LL descriptions in papers into executable and tested code. LockForge provides a carefully crafted pipeline realizing forethought, implementation, iterative refinement, and a multi-stage validation, all to systematically bridge the gap between prose and practice for complex LL schemes. For validation, we devise (i) an LLM-as-Judge stage with a scoring system considering behavioral checks, conceptual mechanisms, structural elements, and reproducibility on benchmarks, and (ii) an independent LLM-as-Examiner stage for ground-truth assessment. We apply LockForge to 10 seminal LL schemes, many of which lack reference implementations. Our evaluation on multiple SOTA LLMs, including ablation studies, reveals the significant complexity of the task. We show that an advanced reasoning model and a sophisticated, multi-stage framework like LockForge are required. We release all implementations and benchmarks, providing a reproducible and fair foundation for evaluation of further LL research.
翻译:尽管逻辑锁定(LL)领域进展迅速,但由于代码很少公开,其可复现性仍面临挑战。本文提出LockForge,这是一个首创的多智能体大语言模型(LLM)框架,能够将论文中的逻辑锁定描述转化为可执行且经过测试的代码。LockForge提供了一个精心设计的流程,实现了前瞻规划、代码实现、迭代优化以及多阶段验证,旨在系统性地弥合复杂逻辑锁定方案在理论描述与实际实现之间的差距。为进行验证,我们设计了(i)一个基于LLM的评判阶段,采用综合考虑行为检查、概念机制、结构要素以及在基准测试上可复现性的评分系统,以及(ii)一个独立的基于LLM的审查阶段,用于进行基于真实情况的评估。我们将LockForge应用于10个开创性的逻辑锁定方案,其中许多缺乏参考实现。基于多种最先进大语言模型(包括消融实验)的评估结果表明,该任务具有显著的复杂性。我们证明,需要如LockForge这样先进的推理模型和复杂多阶段的框架才能有效应对。我们发布了所有实现代码和基准测试,为未来逻辑锁定研究的评估提供了一个可复现且公平的基础。