数据中心中的数据集发现 (Dataset Discovery in Data Lakes)

Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value-adding analytics. Given the potential vastness of such data lakes, the issue arises of how to pull out of the lake those datasets that might contribute to wrangling out a given target. We refer to this as the problem of dataset discovery in data lakes and this paper contributes an effective and efficient solution to it. Our approach uses features of the values in a dataset to construct hash-based indexes that map those features into a uniform distance space. This makes it possible to define similarity distances between features and to take those distances as measurements of relatedness w.r.t. a target table. Given the latter (and exemplar tuples), our approach returns the most related tables in the lake. We provide a detailed description of the approach and report on empirical results for two forms of relatedness (unionability and joinability) comparing them with prior work, where pertinent, and showing significant improvements in all of precision, recall, target coverage, indexing and discovery times.

翻译：数据分析学将受益于越来越多的数据集的可得性,而这些数据集的存取没有明确认识它们的概念关系。在收集时,这些数据集将形成一个数据湖,通过数据交织等过程,可以从中建立具体的目标数据集,以便能够进行增值分析。鉴于这类数据湖的潜在广度,问题在于如何将这些数据集从湖中拉出,从而可能有助于拉出某一目标。我们将此称为在数据湖中发现数据集的问题,本文则有助于有效和高效地解决这一问题。我们的方法使用数据集中数值的特征来构建基于散列的指数,将这些特征映射成一个统一的距离空间。这样可以界定这些特征之间的相似距离,并将这些距离作为相关程度(r.t.)的一个目标表格。鉴于后者(和外号图),我们的方法返回了湖中最相关的表格。我们详细描述了方法和报告两种相关(粘合性和共性)形式的实验结果,将其与先前的发现范围进行比较,同时显示所有相关时间的显著改进和精确度。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【SIGIR2020】学习词项区分性，Learning Term Discrimination

专知会员服务

16+阅读 · 2020年4月28日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日