Pattern sampling has emerged as a promising approach for information discovery in large databases, allowing analysts to focus on a manageable subset of patterns. In this approach, patterns are randomly drawn based on an interestingness measure, such as frequency or hyper-volume. This paper presents the first sampling approach designed to handle interval patterns in numerical databases. This approach, named Fips, samples interval patterns proportionally to their frequency. It uses a multi-step sampling procedure and addresses a key challenge in numerical data: accurately determining the number of interval patterns that cover each object. We extend this work with HFips, which samples interval patterns proportionally to both their frequency and hyper-volume. These methods efficiently tackle the well-known long-tail phenomenon in pattern sampling. We formally prove that Fips and HFips sample interval patterns in proportion to their frequency and the product of hyper-volume and frequency, respectively. Through experiments on several databases, we demonstrate the quality of the obtained patterns and their robustness against the long-tail phenomenon.
翻译:模式采样已成为大型数据库中信息发现的一种有前景的方法,使分析人员能够专注于可管理的模式子集。该方法基于兴趣度量(如频率或超体积)随机抽取模式。本文提出了首个专为数值数据库中区间模式设计的采样方法。该方法命名为Fips,按区间模式的频率比例进行采样。它采用多步采样流程,并解决了数值数据中的一个关键挑战:精确确定覆盖每个对象的区间模式数量。我们进一步扩展了这项工作,提出了HFips方法,该方法按区间模式的频率和超体积比例进行采样。这些方法有效应对了模式采样中众所周知的长尾现象。我们严格证明了Fips和HFips分别按频率比例、以及超体积与频率乘积的比例采样区间模式。通过在多个数据库上的实验,我们验证了所得模式的质量及其对长尾现象的鲁棒性。