网上网页估计变化率的在线数值 (Online Algorithms for Estimating Change Rates of Web Pages)

from arxiv, This is the author version of the paper accepted to {\it International Journal of Performance Evaluation}, Elsevier; 25 pages. arXiv admin note: text overlap with arXiv:2004.02167

A search engine maintains local copies of different web pages to provide quick search results. This local cache is kept up-to-date by a web crawler that frequently visits these different pages to track changes in them. Ideally, the local copy should be updated as soon as a page changes on the web. However, finite bandwidth availability and server restrictions limit how frequently different pages can be crawled. This brings forth the following optimization problem: maximize the freshness of the local cache subject to the crawling frequencies being within prescribed bounds. While tractable algorithms do exist to solve this problem, these either assume the knowledge of exact page change rates or use inefficient methods such as MLE for estimating the same. We address this issue here. We provide three novel schemes for online estimation of page change rates, all of which have extremely low running times per iteration. The first is based on the law of large numbers and the second on stochastic approximation. The third is an extension of the second and includes a heavy-ball momentum term. All these schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. Our main theoretical results concern asymptotic convergence and convergence rates of these three schemes. In fact, our work is the first to show convergence of the original stochastic heavy-ball method when neither the gradient nor the noise variance is uniformly bounded. We also provide some numerical experiments (based on real and synthetic data) to demonstrate the superiority of our proposed estimators over existing ones such as MLE. We emphasize that our algorithms are also readily applicable to the synchronization of databases and network inventory management.

翻译：搜索引擎维护不同网页的本地副本, 以提供快速搜索结果。这个本地缓存由经常访问这些不同页面的网络爬行器不断更新。理想的情况是, 本地副本在网络页面变化后立即更新。但是, 有限带宽可用性和服务器限制限制会限制不同页面的浏览频率。由此产生了以下优化问题 : 最大限度地增加本地缓存的新鲜性, 以适应在指定范围内的爬移频率。虽然存在可移植的算法来解决这个问题, 但所有这些算法要么假定准确页面变化率的知识, 要么使用低效率的方法, 如 MLE 来估算相同页面的变化。我们在这里讨论这个问题。我们提供三种新的计划, 用于对页面变化率进行在线估算, 且每个版本都有极低的运行时间。第一项计划以大数字法为基础, 第二个是随机缩略图。第三个是第二个扩展, 包含一个重球动的库存术语。所有这些计划只需要关于页面变化过程的部分信息, 也就是说, 他们只需要知道网页是否已经更改了, 或者不是真正趋同级的合并, 。我们的主要方法显示我们最初的缩缩缩缩化方法是如何显示这些结果。