As GPUs scale their low precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that converged GPU design trying to address diverging architectural requirements between FP32 (or larger) based HPC and FP16 (or smaller) based DL workloads results in sub-optimal configuration for either of the application domains. We argue that a Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products is the most practical solution to these diverging requirements. A COPA-GPU leverages multi-chip-module disaggregation to support maximal design reuse, along with memory system specialization per application domain. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs. This work explores the microarchitectural design necessary to enable composable GPUs and evaluates the benefits composability can provide to HPC, DL training, and DL inference. We show that when compared to a converged GPU design, a DL-optimized COPA-GPU featuring a combination of 16x larger cache capacity and 1.6x higher DRAM bandwidth scales per-GPU training and inference performance by 31% and 35% respectively and reduces the number of GPU instances by 50% in scale-out training scenarios.
翻译:GPU 将低精度矩阵数学输送量缩放, 以提升深层学习( DL) 性能, 从而打破数学输送量和记忆系统能力之间的平衡。 我们证明, GPU 试图解决基于 FP32 (或较大) 的 HPC 和基于 FP16 (或较小) 的 DL 工作量之间不同建筑要求的GPU 设计会给两个应用程序域中的任何一个域造成亚最佳配置。 我们争辩说, 提供域专用 GPU (COPAGPU) 产品的Opical- PAPU (COPAGPU) 结构是解决这些差异性能要求的最实用的解决方案。 COPA- GPU 更高调控点利用多立式模块分类支持最大设计再利用, 与每个应用程序域的内存系统专门化。 我们展示了CCOPA- GGPU的模块扩增模块, 3x 高包装存储量, 高达2.3x 的 DRAM 带宽度和能力,同时通过定制的HPC 设计来方便地支持降缩缩缩的50度和缩式的 HPC 方向设计。