数据流中的频率估计:学习最佳散列计划 (Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme)

We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random hashing to maintain the frequency distribution of the data steam using limited storage, the proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution. We develop an exact mixed-integer linear optimization formulation, which enables us to compute optimal or near-optimal hashing schemes for elements seen in the observed stream prefix; then, we use machine learning to hash unseen elements. Further, we develop an efficient block coordinate descent algorithm, which, as we empirically show, produces high quality solutions, and, in a special case, we are able to solve the proposed formulation exactly in linear time using dynamic programming. We empirically evaluate the proposed approach both on synthetic datasets and on real-world search query data. We show that the proposed approach outperforms existing approaches by one to two orders of magnitude in terms of its average (per element) estimation error and by 45-90% in terms of its expected magnitude of estimation error.

翻译：我们提出了一种基于优化和机器学习的数据流频率估算问题的新办法。我们提出了一种基于优化和机器学习的对数据流频率估算的新办法。与目前最先进的流流频率估算算法相反,这种算法严重依赖随机散列来维持数据蒸汽使用有限储存的频率分布,而拟议方法则利用观测到的流前缀来利用近于最佳的散列元素元素,压缩目标频率分布。我们开发了精确的混合因数线性优化配方,使我们能够对所观测的溪流前缀所见元素进行最佳或近于最佳的散列计划进行计算;然后,我们利用机器学习来吸收未知元素。此外,我们开发了高效的区块协调下行算法,正如我们的经验显示的那样,这种算法产生了高质量的解决方案,在一个特殊的情况下,我们能够利用动态的编程在线性时间内完全解决拟议的配方。我们从经验上评价了关于合成数据集和现实世界搜索查询数据的拟议方法。我们表明,拟议的方法在平均(每个要素)估计误差和预期误差幅度为45-90%的范围内,使现有方法以一至两个数量级的一至两个级的级的幅度的幅度超出现有方法。