Large Language Models (LLMs) have demonstrated remarkable progress in translating natural language to SQL, but a significant semantic gap persists between their general knowledge and domain-specific semantics of databases. Historical translation logs constitute a rich source of this missing in-domain knowledge, where SQL queries inherently encapsulate real-world usage patterns of database schema. Existing methods primarily enhance the reasoning process for individual translations but fail to accumulate in-domain knowledge from past translations. We introduce ORANGE, an online self-evolutionary framework that constructs database-specific knowledge bases by parsing SQL queries from translation logs. By accumulating in-domain knowledge that contains schema and data semantics, ORANGE progressively reduces the semantic gap and enhances the accuracy of subsequent SQL translations. To ensure reliability, we propose a novel nested Chain-of-Thought SQL-to-Text strategy with tuple-semantic tracking, which reduces semantic errors during knowledge generation. Experiments on multiple benchmarks confirm the practicality of ORANGE, demonstrating its effectiveness for real-world Text-to-SQL deployment, particularly in handling complex and domain-specific queries.
翻译:大型语言模型(LLMs)在将自然语言转换为SQL方面取得了显著进展,但其通用知识与数据库的领域特定语义之间仍存在显著的语义鸿沟。历史翻译日志构成了这一缺失领域知识的丰富来源,其中SQL查询本质上封装了数据库模式的实际使用模式。现有方法主要侧重于增强单个翻译的推理过程,但未能从过往翻译中积累领域知识。我们提出了ORANGE,一种在线自进化框架,通过解析翻译日志中的SQL查询来构建数据库特定的知识库。通过积累包含模式和数据语义的领域知识,ORANGE逐步缩小语义鸿沟,并提升后续SQL翻译的准确性。为确保可靠性,我们提出了一种新颖的嵌套思维链SQL到文本策略,结合元组语义追踪,以减少知识生成过程中的语义错误。在多个基准测试上的实验证实了ORANGE的实用性,展示了其在真实世界文本到SQL部署中的有效性,特别是在处理复杂和领域特定查询方面。