We consider the problem of training machine learning models over multi-relational data. The mainstream approach is to first construct the training dataset using a feature extraction query over input database and then use a statistical software package of choice to train the model. In this paper we introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language, which captures a subset of Python commonly used in Jupyter notebooks for rapid prototyping of machine learning applications. The program is subject to several layers of IFAQ optimizations, such as algebraic transformations, loop transformations, schema specialization, data layout optimizations, and finally compilation into efficient low-level C++ code specialized for the given workload and data. We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and TensorFlow by several orders of magnitude for linear regression and regression tree models over several relational datasets.
翻译:我们考虑在多关系数据方面培训机器学习模型的问题。 主流方法是首先使用输入数据库的特征提取查询来构建培训数据集,然后使用选择的统计软件包来培训模型。 在本文中,我们引入了循环功能综合查询(IFAQ ), 这是一个实现替代方法的框架。 IFAQ 将特性提取查询和学习任务作为IFAQ 特定域语言中给出的一个程序处理,它捕捉了在Jupyter笔记本中常用的一套Python, 用于机器学习应用的快速原型。 这个程序受到IFAQ 数层优化的制约, 如代数个关系数据集的代谢变换、循环变换、 Schema 专门化、数据布局优化, 并最终编成高效的低水平 C++代码, 专门用于给定工作量和数据。 我们显示, IFAQ 的Scala 执行Scala 能够超越模版的 milpack、 Skitt 和 TensorFlow 。