The autoregressive decoding in LLMs is the major inference bottleneck due to the memory-intensive operations and limited hardware bandwidth. 3D-stacked architecture is a promising solution with significantly improved memory bandwidth, which vertically stacked multi DRAM dies on top of logic die. However, our experiments also show the 3D-stacked architecture faces severer thermal issues compared to 2D architecture, in terms of thermal temperature, gradient and scalability. To better exploit the potential of 3D-stacked architecture, we present Tasa, a heterogeneous architecture with cross-stack thermal optimizations to balance the temperature distribution and maximize the performance under the thermal constraints. High-performance core is designed for compute-intensive operations, while high-efficiency core is used for memory-intensive operators, e.g. attention layers. Furthermore, we propose a bandwidth sharing scheduling to improve the bandwidth utilization in such heterogeneous architecture. Extensive thermal experiments show that our Tasa architecture demonstrates greater scalability compared with the homogeneous 3D-stacked architecture, i.e. up to 5.55 $\tccentigrade$, 9.37 $\tccentigrade$, and 7.91 $\tccentigrade$ peak temperature reduction for 48, 60, and 72 core configurations. Our experimental for Llama-65B and GPT-3 66B inferences also demonstrate 2.85x and 2.21x speedup are obtained over the GPU baselines and state-of-the-art heterogeneous PIM-based LLM accelerator
翻译:大语言模型中的自回归解码因其内存密集型操作和有限的硬件带宽成为推理的主要瓶颈。三维堆叠架构通过将多个DRAM芯片垂直堆叠在逻辑芯片之上,显著提升了内存带宽,是一种极具前景的解决方案。然而,我们的实验也表明,与二维架构相比,三维堆叠架构在热温度、温度梯度和可扩展性方面面临更严峻的热问题。为更好地挖掘三维堆叠架构的潜力,我们提出了Tasa,一种采用跨堆叠热优化的异构架构,旨在平衡温度分布并在热约束下最大化性能。该架构设计了高性能核心用于计算密集型操作,而高效能核心则用于内存密集型算子,例如注意力层。此外,我们提出了一种带宽共享调度策略,以提高此类异构架构中的带宽利用率。大量的热实验表明,与同构三维堆叠架构相比,我们的Tasa架构展现出更优的可扩展性,即在48核、60核和72核配置下,峰值温度分别降低了最高5.55°C、9.37°C和7.91°C。我们在Llama-65B和GPT-3 66B模型上的推理实验也表明,相较于GPU基线以及最先进的基于PIM的异构大语言模型加速器,分别实现了2.85倍和2.21倍的加速。