High-performance computing (HPC) clusters consume enormous amounts of energy, with idle nodes as a major source of waste. Powering down unused nodes can mitigate this problem, but poorly timed transitions introduce long delays and reduce overall performance. To address this trade-off, we present SPARS, a reinforcement learning-enabled simulator for power management in HPC job scheduling. SPARS integrates job scheduling and node power state management within a discrete-event simulation framework. It supports traditional scheduling policies such as First Come First Served and EASY Backfilling, along with enhanced variants that employ reinforcement learning agents to dynamically decide when nodes should be powered on or off. Users can configure workloads and platforms in JSON format, specifying job arrivals, execution times, node power models, and transition delays. The simulator records comprehensive metrics-including energy usage, wasted power, job waiting times, and node utilization-and provides Gantt chart visualizations to analyze scheduling dynamics and power transitions. Unlike widely used Batsim-based frameworks that rely on heavy inter-process communication, SPARS provides lightweight event handling and consistent simulation results, making experiments easier to reproduce and extend. Its modular design allows new scheduling heuristics or learning algorithms to be integrated with minimal effort. By providing a flexible, reproducible, and extensible platform, SPARS enables researchers and practitioners to systematically evaluate power-aware scheduling strategies, explore the trade-offs between energy efficiency and performance, and accelerate the development of sustainable HPC operations.
翻译:高性能计算(HPC)集群消耗大量能源,其中空闲节点是主要的浪费来源。关闭未使用的节点可以缓解此问题,但时机不当的节点状态转换会引入长延迟并降低整体性能。为解决这一权衡问题,我们提出了SPARS,一种支持强化学习的高性能计算作业调度功耗管理模拟器。SPARS在离散事件仿真框架内集成了作业调度与节点功耗状态管理。它支持传统调度策略,如先到先服务(FCFS)和EASY回填,同时提供采用强化学习智能体动态决策节点开关时机的增强变体。用户可通过JSON格式配置工作负载与平台,指定作业到达时间、执行时长、节点功耗模型及状态转换延迟。该模拟器记录包括能耗、浪费功耗、作业等待时间及节点利用率在内的综合指标,并提供甘特图可视化以分析调度动态与功耗状态转换。与广泛使用的基于Batsim、依赖繁重进程间通信的框架不同,SPARS提供轻量级事件处理与一致的仿真结果,使实验更易于复现和扩展。其模块化设计允许以最小工作量集成新的调度启发式算法或学习算法。通过提供灵活、可复现且可扩展的平台,SPARS使研究人员与从业者能够系统评估功耗感知调度策略,探索能效与性能之间的权衡,并加速可持续高性能计算运营的发展。