基于人机协作数据标注的文本到SQL领域自适应方法 (Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation)

Text-to-SQL models, which parse natural language (NL) questions to executable SQL queries, are increasingly adopted in real-world applications. However, deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications. We find that existing text-to-SQL models experience significant performance drops when applied to new schemas, primarily due to the lack of domain-specific data for fine-tuning. This data scarcity also limits the ability to effectively evaluate model performance in new domains. Continuously obtaining high-quality text-to-SQL data for evolving schemas is prohibitively expensive in real-world scenarios. To bridge this gap, we propose SQLsynth, a human-in-the-loop text-to-SQL data annotation system. SQLsynth streamlines the creation of high-quality text-to-SQL datasets through human-LLM collaboration in a structured workflow. A within-subjects user study comparing SQLsynth with manual annotation and ChatGPT shows that SQLsynth significantly accelerates text-to-SQL data annotation, reduces cognitive load, and produces datasets that are more accurate, natural, and diverse. Our code is available at https://github.com/magic-YuanTian/SQLsynth.

翻译：文本到SQL模型能够将自然语言问题解析为可执行的SQL查询，在现实应用中的采用日益广泛。然而，在实际部署中，此类模型通常需要适配特定应用场景中高度专业化的数据库模式。我们发现，现有文本到SQL模型应用于新数据库模式时性能显著下降，主要原因是缺乏用于微调的领域特定数据。这种数据稀缺性也限制了在新领域有效评估模型性能的能力。在现实场景中，持续为动态演化的数据库模式获取高质量的文本到SQL数据成本极其高昂。为弥补这一差距，我们提出了SQLsynth——一种人机协同的文本到SQL数据标注系统。SQLsynth通过结构化工作流程中人类与大语言模型的协作，实现了高质量文本到SQL数据集的快速构建。通过对比SQLsynth与人工标注及ChatGPT的受试者内用户研究表明，SQLsynth能显著加速文本到SQL数据标注流程，降低认知负荷，并生成更准确、自然且多样化的数据集。我们的代码已发布于https://github.com/magic-YuanTian/SQLsynth。