Data centres (DCs) underline many prominent future technological trends such as distributed training of large scale machine learning models and internet-of-things based platforms. DCs will soon account for over 3\% of global energy demand, so efficient use of DC resources is essential. Robust DC networks (DCNs) are essential to form the large scale systems needed to handle this demand, but can bottleneck how efficiently DC-server resources can be used when servers with insufficient connectivity between them cannot be jointly allocated to a job. However, allocating servers' resources whilst accounting for their inter-connectivity maps to an NP-hard combinatorial optimisation problem, and so is often ignored in DC resource management schemes. We present Nara, a framework based on reinforcement learning (RL) and graph neural networks (GNN) to learn network-aware allocation policies that increase the number of requests allocated over time compared to previous methods. Unique to our solution is the use of a GNN to generate representations of server-nodes in the DCN, which are then interpreted as actions by a RL policy-network which chooses from which servers resources will be allocated to incoming requests. Nara is agnostic to the topology size and shape and is trained end-to-end. The method can accept up to 33\% more requests than the best baseline when deployed on DCNs with up to the order of $10\times$ more compute nodes than the DCN seen during training and is able to maintain its policy's performance on DCNs with the order of $100\times$ more servers than seen during training. It also generalises to unseen DCN topologies with varied network structure and unseen request distributions without re-training.
翻译:数据中心(DCs)强调许多突出的未来技术趋势,如大规模机器学习模型和互联网优化平台的分布式培训等,强调许多突出的未来技术趋势,例如大规模机器学习模型和基于互联网的平台的分布式培训。发展中国家很快会占全球能源需求的3 ⁇ 以上,因此高效使用DC资源至关重要。强大的DC网络(DCN)对于形成满足这一需求所需的大规模系统至关重要,但在服务器之间连接不足的服务器无法被联合分配到一个工作时,可以抑制如何高效地使用DC-服务器资源。然而,在将服务器资源配置为NP-硬的组合优化平台的同时,将服务器的互连性地图记到NP-硬的组合组合优化问题,而DC资源管理机制也常常忽略这一点。我们提出纳拉这个基于强化学习(RL)和图形神经网络(GNNN)的框架,以学习网络对网络的网络认知性能分配政策,与以往方法相比,随着时间的推移分配的请求数量更多。我们的解决方案是使用GNNNW(GNN)到DNC)显示服务器-节点的显示,然后由R-CN政策网络进行行动,从选择从哪个服务器向上打印的服务器的配置结构,在上分配到更高级的服务器上,在向上一级要求期间,可以接受其排序。NARARC的资源分配到更倾向于的排序。