Resource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers in an online setting is often intractable for "hyper-scale" system sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. In this work, we explore an alternative approach that reuses the original optimization problem formulation. By splitting the original problem into smaller, more tractable problems for subsets of the system and then coalescing resulting sub-allocations into a global solution, we achieve empirically quasi-optimal (within 1.5%) performance for multiple domains with several orders-of-magnitude improvement in runtime. Deciding how to split a large problem into smaller sub-problems, and how to coalesce split allocations into a unified allocation, needs to be performed carefully in a domain-aware way. We show common principles for splitting problems effectively across a variety of tasks, including cluster scheduling, traffic engineering, and load balancing.
翻译:许多计算机系统中的资源分配问题可以被描述为数学优化问题。然而,在网上环境下使用现成的解决方案为这些问题找到精确的解决方案,对于“超规模”系统规模往往难以解决,因为系统设计者依赖廉价的、累进式的算法。在这项工作中,我们探索一种替代办法,重新利用最初的优化问题配方。通过将最初的问题分成系统子集的较小、更易处理的问题,然后将由此产生的子集成合并为全球解决方案,我们实现多个领域的实证性准最佳性能(在1.5 % ), 并有几种按级按级按级按重量改进。决定如何将一个大问题分成一个小的子问题,以及如何将分配分成一个统一的分配,需要谨慎地进行。我们展示了将问题有效地分成各种任务的共同原则,包括集束安排、交通工程和负荷平衡。