Scenario simulation is central to testing autonomous driving systems. Scenic, a domain-specific language (DSL) for CARLA, enables precise and reproducible scenarios, but NL-to-Scenic generation with large language models (LLMs) suffers from scarce data, limited reproducibility, and inconsistent metrics. We introduce NL2Scenic, an open dataset and framework with 146 NL/Scenic pairs, a difficulty-stratified 30-case test split, an Example Retriever, and 14 prompting variants (ZS, FS, CoT, SP, MoT). We evaluate 13 models: four proprietary (GPT-4o, GPT-5, Claude-Sonnet-4, Gemini-2.5-pro) and nine open-source code models (Qwen2.5Coder 0.5B-32B; CodeLlama 7B/13B/34B), using text metrics (BLEU, ChrF, EDIT-SIM, CrystalBLEU) and execution metrics (compilation and generation), and compare them with an expert study (n=11). EDIT-SIM correlates best with human judgments; we also propose EDIT-COMP (F1 of EDIT-SIM and compilation) as a robust dataset-level proxy that improves ranking fidelity. GPT-4o performs best overall, while Qwen2.5Coder-14B reaches about 88 percent of its expert score on local hardware. Retrieval-augmented prompting, Few-Shot with Example Retriever (FSER), consistently boosts smaller models, and scaling shows diminishing returns beyond mid-size, with Qwen2.5Coder outperforming CodeLlama at comparable scales. NL2Scenic and EDIT-COMP offer a standardized, reproducible basis for evaluating Scenic code generation and indicate that mid-size open-source models are practical, cost-effective options for autonomous-driving scenario programming.
翻译:场景仿真是自动驾驶系统测试的核心。Scenic作为CARLA的领域特定语言,能够实现精确且可复现的场景,但使用大语言模型进行自然语言到Scenic的生成面临数据稀缺、可复现性有限和评估指标不一致的问题。我们提出了NL2Scenic——一个包含146个自然语言/Scenic配对的开源数据集与框架,包含难度分层的30个测试案例划分、示例检索器以及14种提示变体(零样本、少样本、思维链、自洽性、多数投票)。我们评估了13个模型:四个专有模型(GPT-4o、GPT-5、Claude-Sonnet-4、Gemini-2.5-pro)和九个开源代码模型(Qwen2.5Coder 0.5B-32B;CodeLlama 7B/13B/34B),采用文本指标(BLEU、ChrF、EDIT-SIM、CrystalBLEU)与执行指标(编译与生成),并通过专家研究(n=11)进行对比。EDIT-SIM与人类判断相关性最高;我们还提出EDIT-COMP(EDIT-SIM与编译率的F1值)作为鲁棒的数据集级代理指标,可提升排序保真度。GPT-4o整体表现最佳,而Qwen2.5Coder-14B在本地硬件上达到其专家评分的约88%。检索增强提示策略(配备示例检索器的少样本学习)能持续提升较小模型性能,参数规模扩展在超过中等规模后呈现收益递减趋势,且同等规模下Qwen2.5Coder优于CodeLlama。NL2Scenic与EDIT-COMP为评估Scenic代码生成提供了标准化、可复现的基础框架,表明中等规模开源模型是自动驾驶场景编程实践中具有成本效益的可行选择。