Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, yet they exhibit systematic errors on complex, multi-step programming tasks. We hypothesize that these errors stem from the flexibility of general-purpose languages, which permits multiple valid approaches and requires implicit state management. To test this hypothesis, we introduce Anka, a domain-specific language (DSL) for data transformation pipelines designed with explicit, constrained syntax that reduces ambiguity in code generation. Despite having zero prior training exposure to Anka, Claude 3.5 Haiku achieves 99.9% parse success and 95.8% overall task accuracy across 100 benchmark problems. Critically, Anka demonstrates a 40 percentage point accuracy advantage over Python on multi-step pipeline tasks (100% vs. 60%), where Python's flexible syntax leads to frequent errors in operation sequencing and variable management. Cross-model validation with GPT-4o-mini confirms this advantage (+26.7 percentage points on multi-step tasks). Our results demonstrate that: (1) LLMs can learn novel DSLs entirely from in-context prompts, achieving near-native accuracy; (2) constrained syntax significantly reduces errors on complex tasks; and (3) domain-specific languages purposefully designed for LLM generation can outperform general-purpose languages on which the LLM has extensive training. We release the complete language implementation, benchmark suite, and evaluation framework to facilitate further research.
翻译:大型语言模型(LLM)在代码生成方面展现出卓越能力,但在处理复杂多步骤编程任务时仍存在系统性错误。我们假设这些错误源于通用编程语言的灵活性——其允许多种有效实现方式且需要隐式状态管理。为验证该假设,我们提出了Anka,一种专为数据转换流水线设计的领域特定语言(DSL),其通过显式约束语法降低代码生成的歧义性。尽管Claude 3.5 Haiku从未接受过Anka相关训练,其在100个基准问题上仍实现了99.9%的解析成功率和95.8%的整体任务准确率。关键发现是:在多步骤流水线任务中,Anka相较Python展现出40个百分点的准确率优势(100%对比60%),Python的灵活语法常导致操作顺序与变量管理错误。通过GPT-4o-mini的跨模型验证进一步确认了该优势(多步骤任务提升26.7个百分点)。我们的研究证明:(1)LLM能够完全通过上下文提示学习新型DSL,达到接近原生的准确率;(2)约束语法显著降低复杂任务错误率;(3)专为LLM生成设计的领域特定语言可超越LLM经过大量训练的通用语言。我们公开了完整的语言实现、基准测试集与评估框架以促进后续研究。