Recent work has made significant progress in helping users to automate single data preparation steps, such as string-transformations and table-manipulation operators (e.g., Join, GroupBy, Pivot, etc.). We in this work propose to automate multiple such steps end-to-end, by synthesizing complex data pipelines with both string transformations and table-manipulation operators. We propose a novel "by-target" paradigm that allows users to easily specify the desired pipeline, which is a significant departure from the traditional by-example paradigm. Using by-target, users would provide input tables (e.g., csv or json files), and point us to a "target table" (e.g., an existing database table or BI dashboard) to demonstrate how the output from the desired pipeline would schematically "look like". While the problem is seemingly underspecified, our unique insight is that implicit table constraints such as FDs and keys can be exploited to significantly constrain the space to make the problem tractable. We develop an Auto-Pipeline system that learns to synthesize pipelines using reinforcement learning and search. Experiments on large numbers of real pipelines crawled from GitHub suggest that Auto-Pipeline can successfully synthesize 60-70% of these complex pipelines with up to 10 steps.
翻译:最近的工作在帮助用户实现单项数据编制步骤自动化方面取得了显著进展,例如字符串转换和表控操作员(例如,JING、GroupBy、Pivot等)。我们在此工作中建议通过将复杂的数据管道与字符串转换和表控操作员合并,将多个此类步骤的端端到端自动化。我们提出了一个新的“逐目标”模式,使用户能够轻松地指定所需的管道,这与传统的边际模式有很大的偏离。用户将提供输入表(例如,Csv或json文件),并将我们指向“目标表格”(例如,现有数据库表格或BI仪表),以展示如何用字符串转换和表控管操作的精度来“看似”。虽然问题似乎未得到充分描述,但我们独特的洞察力是,可以利用FD和键等隐含的表格限制来大大限制空间,使问题可被拉动。我们开发了一个自动管道系统(例如,csv或json文件),并将我们指向一个“目标表格”(例如,现有数据库表格表或BI仪仪仪仪仪仪仪仪),以学习如何将Girodu化大型管道的60/GRA化。我们可以成功地学习如何将GILULULA学习10号。