数据中心的故障预测 (Workload Failure Prediction for Data Centers)

Failed workloads that consumed significant computational resources in time and space affect the efficiency of data centers significantly and thus limit the amount of scientific work that can be achieved. While the computational power has increased significantly over the years, detection and prediction of workload failures have lagged far behind and will become increasingly critical as the system scale and complexity further increase. In this study, we analyze workload traces collected from a production cluster and train machine learning models on a large amount of data sets to predict workload failures. Our prediction models consist of a queue-time model that estimates the probability of workload failures before execution and a runtime model that predicts failures at runtime. Evaluation results show that the queue-time model and runtime model can predict workload failures with a maximum precision score of 90.61% and 97.75%, respectively. By integrating the runtime model with the job scheduler, it helps reduce CPU time, and memory usage by up to 16.7% and 14.53%, respectively.

翻译：在时间和空间上消耗大量计算资源的失败工作量对数据中心的效率有很大影响,从而限制了可以完成的科学工作的数量。虽然计算能力多年来大幅增加,但发现和预测工作量失败的工作却远远落后于系统规模和复杂性进一步增加,并将随着系统规模和复杂性进一步增加而变得日益重要。在这项研究中,我们分析了从生产组收集的工作量痕迹,并用大量数据集对机器学习模型进行了培训,以预测工作量失败。我们的预测模型包括一个排队时间模型,该模型估计执行前工作量失败的概率,以及一个运行时间模型,预测运行时间失败。评价结果显示,排队时间模型和运行时间模型可以预测工作量失败,最高精确得分分别为90.61%和97.75%。通过将运行时间模型与工作时间安排器结合起来,它帮助将运行时间模型的时间和记忆使用分别减少16.7%和14.53%。

相关内容

MoDELS

关注 30

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

71+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

72+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

98+阅读 · 2022年2月10日

NLP必读经典文献100篇

专知会员服务

123+阅读 · 2020年9月8日