Recent query explanation systems help users understand anomalies in aggregation results by proposing predicates that describe input records that, if deleted, would resolve the anomalies. However, it can be difficult for users to understand how a predicate was chosen, and these approaches are limited to errors that can be resolved through deletion. In contrast, data errors may be due to group-wise errors, such as missing records or systematic value errors. This paper presents Reptile, an explanation system for hierarchical data. Given an anomalous aggregate query result, Reptile recommends the next drill-down attribute,and ranks the drill-down groups based on the extent repairing the group's statistics to its expected values resolves the anomaly. Reptile efficiently trains a multi-level model that leverages the data's hierarchy to estimate the expected values, and uses a factorised representation of the feature matrix to remove redundancies due to the data's hierarchical structure. We further extend model training to support factorised data, and develop a suite of optimizations that leverage the data's hierarchical structure. Reptile reduces end-to-end runtimes by more than 6 times compared to a Matlab-based implementation, correctly identifies 21/30 data errors in John Hopkin's COVID-19 data, and correctly resolves 20/22 complaints in a user study using data and researchers from Columbia University's Financial Instruments Sector Team.
翻译:最近的查询解释系统通过提出描述输入记录的前提,帮助用户理解汇总结果中的异常现象,这些缺陷如果被删除,将消除异常现象。然而,用户可能很难理解如何选择上游,这些方法仅限于通过删除可以解决的错误。相反,数据错误可能是由于群体错误,如缺失记录或系统值错误。本文介绍了等级数据解释系统Reptile。鉴于一个异常的汇总查询结果,Reptile建议下一个钻头下调属性,并根据将小组统计数据修复到预期值的程度,将钻头下调组排在排序上,从而解决异常现象。灵活有效地培训一个多层次模型,利用数据等级结构来估计预期值,并使用特征矩阵的因群体错误(如缺失记录或系统价值错误)而出现的因群体错误(如缺失记录或系统价值错误)而出现数据错误。我们进一步扩展模型培训以支持要素化的数据,并开发一套优化组合,以利用数据等级结构。 Repitile将端端端到端端端端端段的时间比马特拉金- 大学数据库20世纪-30年期的用户系统数据库实施中正确识别数据,正确识别了21-22世纪的系统系统用户系统系统数据库数据,并正确识别系统系统数据库的系统。