General Matrix Multiplication (GEMM) is a critical kernel in high-performance computing and deep learning. While modern architectures like ARM's Scalable Matrix Extension (SME) introduce dedicated hardware for matrix operations, existing linear algebra libraries fail to fully exploit its potential, particularly for large matrices. This paper presents MpGEMM, an open-source library that leverages key architectural features of SME to optimize GEMM across multiple precisions. Through a systematic characterization of SME, we derive optimization guidelines that inform our design. MpGEMM employs cache-aware partitioning, efficient data packing with on-the-fly transposition, and specialized micro-kernels that utilize multi-vector loads and all available tile registers. Evaluated on an Apple M4 Pro with real-world workloads from DeepSeek and LLaMA, MpGEMM achieves an average speedup of 1.23x over the vendor-optimized Apple Accelerate library and significantly outperforms other open-source alternatives.
翻译:通用矩阵乘法(GEMM)是高性能计算和深度学习中的核心计算内核。尽管现代架构如ARM可扩展矩阵扩展(SME)为矩阵运算引入了专用硬件,但现有的线性代数库未能充分利用其潜力,尤其对于大型矩阵。本文提出了MpGEMM,这是一个利用SME关键架构特性来优化多精度GEMM的开源库。通过对SME的系统性特征分析,我们得出了指导设计的优化准则。MpGEMM采用缓存感知的分块策略、支持即时转置的高效数据打包技术,以及利用多向量加载和全部可用瓦片寄存器的专用微内核。在配备Apple M4 Pro的硬件上,使用来自DeepSeek和LLaMA的真实工作负载进行评估,MpGEMM相比供应商优化的Apple Accelerate库平均实现了1.23倍的加速,并显著优于其他开源替代方案。