Lit Silicon：热失衡在多GPU并发执行中产生耦合效应的案例研究 (Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs)

GPU systems are increasingly powering modern datacenters at scale. Despite being highly performant, GPU systems suffer from performance variation at the node and cluster levels. Such performance variation significantly impacts both high-performance computing and artificial intelligence workloads, such as cutting-edge large language models (LLMs). We analyze the performance of a single-node multi-GPU system running LLM training, and observe that the kernel-level performance variation is highly correlated with concurrent computation communication (C3), a technique to overlap computation and communication across GPUs for performance gains. We then take a further step to reason that thermally induced straggling coupling with C3 impacts performance variation, coined as the Lit Silicon effect. Lit Silicon describes that in a multi-GPU node, thermal imbalance across GPUs introduces node-level straggler GPUs, which in turn slow down the leader GPUs. Lit Silicon leads to node-level performance variation and inefficiency, impacting the entire datacenter from the bottom up. We propose analytical performance and power models for Lit Silicon, to understand the potential system-level gains. We further design simple detection and mitigation techniques to effectively address the Lit Silicon problem, and evaluate three different power management solutions, including power optimization under GPU thermal design power, performance optimization under node-level GPU power capping, and performance optimization under node-level CPU power sloshing. We conduct experiments on two workloads on two AMD InstinctTM MI300X GPU systems under two LLM training frameworks, and observe up to 6% performance and 4% power improvements, potentially saving hundreds of millions of dollars in datacenters. Our solution is almost free lunch and can be effortlessly adopted in datacenters as a new node-level power management layer.

翻译：GPU系统正日益大规模地驱动现代数据中心。尽管性能卓越，GPU系统在节点和集群层面仍存在性能波动问题。这种性能波动显著影响了高性能计算和人工智能工作负载，例如前沿的大语言模型（LLMs）。我们分析了运行LLM训练的单节点多GPU系统性能，观察到内核级性能波动与并发计算通信（C3）高度相关——C3是一种通过重叠GPU间计算与通信以提升性能的技术。进一步研究推断，热效应引起的滞后与C3耦合影响了性能波动，这一现象被定义为Lit Silicon效应。Lit Silicon指在多GPU节点中，GPU间的热失衡引发节点级滞后GPU，进而拖慢领先GPU的速度。该效应导致节点级性能波动与效率低下，自下而上地影响整个数据中心。我们提出了Lit Silicon的分析性能与功耗模型，以理解潜在的系统级收益。进一步设计了简易的检测与缓解技术以有效应对Lit Silicon问题，并评估了三种不同的功耗管理方案：包括GPU热设计功耗下的功耗优化、节点级GPU功耗封顶下的性能优化，以及节点级CPU功耗动态调配下的性能优化。我们在两个AMD Instinct™ MI300X GPU系统上，针对两种工作负载和两种LLM训练框架进行实验，观察到最高6%的性能提升和4%的功耗改善，有望为数据中心节省数亿美元成本。我们的解决方案近乎零成本，可作为新的节点级功耗管理层在数据中心轻松部署。