The use of deep learning for database optimization has gained significant traction, offering improvements in indexing, cardinality estimation, and query optimization. However, acquiring high-quality training data remains a significant challenge. This paper explores the possibility of using generative models, such as GPT, to synthesize training data for learned database components. We present an initial feasibility study investigating their ability to produce realistic query distributions and execution plans for database workloads. Additionally, we discuss key challenges, such as data scalability and labeling, along with potential solutions. The initial results suggest that generative models can effectively augment training datasets, improving the adaptability of learned database techniques.
翻译:深度学习在数据库优化领域的应用已获得显著关注,其在索引构建、基数估计和查询优化等方面展现出改进潜力。然而,获取高质量训练数据仍是重大挑战。本文探讨了利用生成模型(如GPT)为数据库学习组件合成训练数据的可行性。我们通过初步可行性研究,考察了此类模型生成真实查询分布与数据库工作负载执行计划的能力。同时,我们讨论了数据可扩展性与标注等关键挑战及其潜在解决方案。初步结果表明,生成模型能有效扩充训练数据集,从而提升数据库学习技术的适应性。