Quantization to low bitwidth is a standard approach for deploying large language models, however, a few extreme weights and activations stretch the dynamic range and reduce the effective resolution of the quantizer. A common mitigation approach is to apply some fixed orthogonal transforms, such as Hadamard matrices, before quantization, which typically reduces the dynamic range. Yet, these transforms ignore the statistics of the data, and their optimality is currently not understood. In this work, we derive, for the first time, closed-form optimal linear blockwise transforms for joint weight-activation quantization using standard data-free quantizers for common numerical formats. Specifically, we provide derivations of the optimal adaptive (data-aware) transforms for round-to-nearest (RTN), AbsMax-scaled block quantizers for both integer and floating-point formats. The resulting construction, which we call WUSH, combines a Hadamard backbone with a data-dependent component based on second-order moments, yielding a non-orthogonal transform that is provably optimal under mild assumptions and remains structured for efficient implementation. Preliminary experimental results show that our approach consistently improves upon the Hadamard transform for common formats.
翻译:将大语言模型量化至低比特位宽是部署中的标准方法,然而,少数极端的权重和激活值会拉伸动态范围,降低量化器的有效分辨率。常见的缓解方法是量化前应用固定的正交变换(如哈达玛矩阵),这通常能减小动态范围。但这些变换忽略了数据的统计特性,且其最优性目前尚未得到充分理解。本研究首次针对常见数值格式的标准无数据量化器,推导出用于联合权重-激活量化的闭式最优线性分块变换。具体而言,我们推导了适用于整数与浮点格式的舍入最近邻(RTN)及AbsMax缩放分块量化器的最优自适应(数据感知)变换。所得构造称为WUSH,它将哈达玛矩阵主干与基于二阶矩的数据依赖组件相结合,生成一种非正交变换——该变换在温和假设下可证明为最优,并保持结构化以实现高效计算。初步实验结果表明,对于常见数值格式,本方法相较于哈达玛变换均能取得稳定提升。