Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions, offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs). However, the severe quantization constraint ($\pm1$) can lead to significant accuracy degradation. In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors. DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods. Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP\# and QTIP. Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization's intermediate dimension. Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria. Code available at: https://github.com/usamec/double_binary
翻译:二值量化方法通过将权重矩阵替换为二值矩阵,并将昂贵的乘法运算替换为成本更低的加法运算,为解决大型语言模型日益增长的计算和存储需求提供了一种计算高效的方法。然而,严格的量化约束($\pm1$)可能导致显著的精度下降。本文提出了双重二值化分解(DBF),这是一种新颖的方法,它将稠密权重矩阵分解为两个二值(符号)矩阵的乘积,每个矩阵都配有缩放向量。DBF 保留了二值表示的高效性优势,同时实现了与最先进方法相当或更优的压缩率。具体而言,在每权重1比特的范围内,DBF 优于现有的二值化方法。在每权重2比特的范围内,DBF 可与 QuIP# 和 QTIP 等最佳量化方法相媲美。与大多数现有压缩技术仅提供有限的压缩级别选择不同,DBF 允许通过调整分解的中间维度来精细控制压缩比。基于这一优势,我们进一步引入了一种算法,该算法基于先前开发的通道剪枝准则,用于估计 DBF 的非均匀逐层压缩比。代码位于:https://github.com/usamec/double_binary