In this paper, we introduce MARS, a new scheduling system for HPC-cloud infrastructures based on a cost-aware, flexible reinforcement learning approach, which serves as an intermediate layer for next generation HPC-cloud resource manager. MARS ensembles the pre-trained models from heuristic workloads and decides on the most cost-effective strategy for optimization. A whole workflow application would be split into several optimizable dependent sub-tasks, then based on the pre-defined resource management plan, a reward will be generated after executing a scheduled task. Lastly, MARS updates the Deep Neural Network (DNN) model based on the reward. MARS is designed to optimize the existing models through reinforcement mechanisms. MARS adapts to the dynamics of workflow applications, selects the most cost-effective scheduling solution among pre-built scheduling strategies (backfilling, SJF, etc.) and self-learning deep neural network model at run-time. We evaluate MARS with different real-world workflow traces. MARS can achieve 5%-60% increased performance compared to the state-of-the-art approaches.
翻译:在本文中,我们介绍了一个基于成本意识的灵活强化学习方法的HPC-Cloud基础设施新时间安排系统,该系统是新一代HPC-Cloud资源管理人的中间层,是HPC-Cloud资源管理人的一种中间层。MARS将经过预先培训的模型从繁忙的工作量中归纳出来,并决定最有成本效益的优化战略。整个工作流程应用程序将分成若干可优化的附属子任务,然后根据预先确定的资源管理计划,在执行预定任务后将获得奖励。最后,MARS更新了基于奖励的深神经网络模式。MARS旨在通过强化机制优化现有模型。MARS适应工作流程应用程序的动态,选择了预先制定的时间安排战略(补装、SJF等)中最具成本效益的时间安排解决方案,并在运行时自行学习深神经网络模式。我们用不同的实体工作流程轨迹对MARS进行评估。MARS可以比目前采用的方法提高5%-60%的绩效。