Rumble: 大混乱数据集的数据独立 (Rumble: Data Independence for Large Messy Data Sets)

This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.

翻译：本文介绍Rumble, 这个引擎执行Jsoniq 询问大、多式和嵌套的JSonniq 对象收藏, 利用Spark 的平行能力提供高程度的数据独立性。设计基于两个关键洞察力:(一) 如何将JSoniq 表达式映射成RDDs上的Spark 变形, (二) 如何将JSoniq FLWOR 条款映射成数据框架的SPark SQL 。我们开发了这些绘图的工作性实施方法, 显示JSoniq 可以有效地在Spark上运行, 将数十亿天天天天天天体标到TB 范围。 JSonniq 代码与Spark 的主语言比较简洁简洁, 而Spark SQL 并不完美。通常发现, 处理这类输入的能力对于数据清理和校正。实验分析显示, 与Spoint SQL 一样, 在结构数据系统中, 一种语言可以像高层次数据结构化的解算法一样, 。