数值数据库中区间模式的高效采样方法 (Efficiently Sampling Interval Patterns from Numerical Databases)

Pattern sampling has emerged as a promising approach for information discovery in large databases, allowing analysts to focus on a manageable subset of patterns. In this approach, patterns are randomly drawn based on an interestingness measure, such as frequency or hyper-volume. This paper presents the first sampling approach designed to handle interval patterns in numerical databases. This approach, named Fips, samples interval patterns proportionally to their frequency. It uses a multi-step sampling procedure and addresses a key challenge in numerical data: accurately determining the number of interval patterns that cover each object. We extend this work with HFips, which samples interval patterns proportionally to both their frequency and hyper-volume. These methods efficiently tackle the well-known long-tail phenomenon in pattern sampling. We formally prove that Fips and HFips sample interval patterns in proportion to their frequency and the product of hyper-volume and frequency, respectively. Through experiments on several databases, we demonstrate the quality of the obtained patterns and their robustness against the long-tail phenomenon.

翻译：模式采样已成为大型数据库中信息发现的一种有前景的方法，使分析人员能够专注于可管理的模式子集。该方法基于兴趣度量（如频率或超体积）随机抽取模式。本文提出了首个专为数值数据库中区间模式设计的采样方法。该方法命名为Fips，按区间模式的频率比例进行采样。它采用多步采样流程，并解决了数值数据中的一个关键挑战：精确确定覆盖每个对象的区间模式数量。我们进一步扩展了这项工作，提出了HFips方法，该方法按区间模式的频率和超体积比例进行采样。这些方法有效应对了模式采样中众所周知的长尾现象。我们严格证明了Fips和HFips分别按频率比例、以及超体积与频率乘积的比例采样区间模式。通过在多个数据库上的实验，我们验证了所得模式的质量及其对长尾现象的鲁棒性。