无需改变基础而保持速度：深度神经网络中矩阵乘法的GPU高效替代方案 (Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs)

Modern AI relies on huge matrix multiplications (MatMuls), whose computation poses a scalability problem for inference and training. We propose an alternative, GPU native bilinear operator to MatMuls in neural networks, which offers a three-way tradeoff between: speed, accuracy and parameter count. In particular, this operator requires substantially fewer FLOPs to evaluate ($\ll n^3$), yet increases the parameter count compared to MatMul ($\gg n^2$). We call this operator Strassen-Tile (STL). The key idea behind STL is a local learnable change-of-basis, applied on tiles of the weight and activation matrices, followed by an element-wise product between the tiles, implemented simultaneously via MatMul. The key technical question we study is how to optimize the change-of-basis of a given layer, which is a highly non-convex problem. We show that theory-backed initializations (inspired by fast matrix and polynomial multiplication) lead to substantially better accuracy than random SGD initialization. This phenomenon motivates further algorithmic study of STL optimization in DNNs. Our experiments demonstrate that STL can approximate 4x4 MatMul of tiles while reducing FLOPs by a factor of 2.66, and can improve Imagenet-1K accuracy of SoTA T2T-ViT-7 (4.3M parameters) while lowering FLOPs. Even with non-CUDA optimized PyTorch code, STL achieves wall-clock speedups in the compute-bound regime. These results, together with its theoretical grounds, suggest STL as a promising building block for scalable and cost-efficient AI.

翻译：现代人工智能依赖于庞大的矩阵乘法（MatMul），其计算在推理和训练过程中存在可扩展性问题。我们提出一种替代方案——神经网络中矩阵乘法的GPU原生双线性算子，该算子在速度、精度和参数量之间实现了三重权衡。具体而言，该算子的浮点运算次数显著减少（$\ll n^3$），但参数量较矩阵乘法有所增加（$\gg n^2$）。我们称该算子为Strassen-Tile（STL）。STL的核心思想是对权重矩阵和激活矩阵的分块进行局部可学习的基变换，随后通过矩阵乘法同步实现分块间的逐元素乘积。我们研究的关键技术问题是如何优化给定层的基变换——这是一个高度非凸的优化问题。我们证明，基于理论推导的初始化方法（受快速矩阵乘法与多项式乘法启发）相较于随机SGD初始化能显著提升精度。这一现象推动了对深度神经网络中STL优化算法的进一步研究。实验表明，STL能以2.66倍的浮点运算量缩减近似实现4×4分块矩阵乘法，并在降低浮点运算量的同时提升SoTA T2T-ViT-7模型（430万参数）在Imagenet-1K数据集上的准确率。即使使用未经CUDA优化的PyTorch代码，STL在计算受限场景下仍能实现实际运行加速。这些结果及其理论基础表明，STL有望成为构建可扩展且高性价比人工智能系统的关键组件。