Large Language Models (LLMs) demand substantial computational resources, resulting in high energy consumption on GPUs. To address this challenge, we focus on Coarse-Grained Reconfigurable Arrays (CGRAs) as an effective alternative that provides a trade-off between energy efficiency and programmability. This paper presents the first comprehensive, end-to-end evaluation of a non-AI-specialized Coarse-Grained Linear Array (CGLA) accelerator for the state-of-the-art Qwen LLM family. The architecture has a general-purpose, task-agnostic design, yet its flexible instruction set allows for domain-specific adaptations. This flexibility enables the architecture to achieve high efficiency for sustainable LLM inference. We assess the performance of our architecture on an FPGA prototype using the widely adopted llama.cpp framework. We then project its potential as a 28nm ASIC and compare it against a high-performance GPU (NVIDIA RTX 4090) and an edge AI device (NVIDIA Jetson AGX Orin). While GPUs exhibit lower latency, our non-AI-specific accelerator achieves higher energy efficiency, improving the Power-Delay Product (PDP) by up to 44.4x and 13.6x compared with the RTX 4090 and Jetson, respectively. Similarly, it reduces the Energy-Delay Product (EDP) by up to 11.5x compared to the high-performance GPU, demonstrating a favorable performance-energy trade-off. Critically, our system-level analysis identifies host-accelerator data transfer as the primary performance bottleneck, a factor often overlooked in kernel-level studies. These findings provide design guidance for next-generation LLM accelerators. This work validates CGRAs as a suitable platform for LLM inference in power-constrained environments, without being confined to specific algorithms.
翻译:大型语言模型(LLMs)需要大量计算资源,导致GPU上的能耗较高。为应对这一挑战,我们聚焦于粗粒度可重构阵列(CGRAs)作为一种有效的替代方案,其在能效与可编程性之间提供了平衡。本文首次对非AI专用的粗粒度线性阵列(CGLA)加速器进行了全面、端到端的评估,该加速器针对前沿的Qwen LLM系列进行了优化。该架构采用通用、任务无关的设计,但其灵活的指令集允许进行领域特定的适配。这种灵活性使该架构能够实现可持续LLM推理的高效性。我们基于广泛采用的llama.cpp框架,在FPGA原型上评估了该架构的性能,并预测了其作为28nm ASIC的潜力,同时与高性能GPU(NVIDIA RTX 4090)和边缘AI设备(NVIDIA Jetson AGX Orin)进行了比较。尽管GPU表现出较低的延迟,我们的非AI专用加速器实现了更高的能效,与RTX 4090和Jetson相比,功率延迟积(PDP)分别提升了最高44.4倍和13.6倍。同样,与高性能GPU相比,其能量延迟积(EDP)降低了最高11.5倍,展现了优越的性能-能耗权衡。关键的是,我们的系统级分析指出主机-加速器数据传输是主要性能瓶颈,这一因素在核级研究中常被忽视。这些发现为下一代LLM加速器的设计提供了指导。本工作验证了CGRAs作为功耗受限环境中LLM推理的合适平台,且不受限于特定算法。