Scientific collaborations are increasingly relying on large volumes of data for their work and many of them employ tiered systems to replicate the data to their worldwide user communities. Each user in the community often selects a different subset of data for their analysis tasks; however, members of a research group often are working on related research topics that require similar data objects. Thus, there is a significant amount of data sharing possible. In this work, we study the access traces of a federated storage cache known as the Southern California Petabyte Scale Cache. By studying the access patterns and potential for network traffic reduction by this caching system, we aim to explore the predictability of the cache uses and the potential for a more general in-network data caching. Our study shows that this distributed storage cache is able to reduce the network traffic volume by a factor of 2.35 during a part of the study period. We further show that machine learning models could predict cache utilization with an accuracy of 0.88. This demonstrates that such cache usage is predictable, which could be useful for managing complex networking resources such as in-network caching.
翻译:科学协作越来越多地依靠大量数据开展工作,其中许多人利用分层系统将数据复制给世界各地的用户社区。社区中的每个用户往往为分析任务选择不同的一组数据;然而,一个研究小组成员往往在研究需要类似数据对象的相关研究课题;因此,可以进行大量的数据共享。在这项工作中,我们研究了称为南加利福尼亚Petabyte比例缓存的联结存储缓存的存取痕迹。通过研究这个缓存系统的存取模式和网络流量减少潜力,我们的目标是探讨缓存用途的可预测性和网络内数据更一般缓存的可能性。我们的研究显示,在研究期的一部分时间里,分散的存储缓存能够将网络流量减少2.35倍。我们进一步表明,机器学习模型可以预测0.88的缓存利用准确度。这说明,这种缓存的使用是可预测的,有助于管理网络内缓存等复杂的网络资源。