Large language models (LLMs) significantly enhance the performance of various applications, but they are computationally intensive and energy-demanding. This makes it challenging to deploy them on devices with limited resources, such as personal computers and mobile/wearable devices, and results in substantial inference costs in resource-rich environments like cloud servers. To extend the use of LLMs, we introduce a low-rank decomposition approach to effectively compress these models, tailored to the requirements of specific applications. We observe that LLMs pretrained on general datasets contain many redundant components not needed for particular applications. Our method focuses on identifying and removing these redundant parts, retaining only the necessary elements for the target applications. Specifically, we represent the weight matrices of LLMs as a linear combination of base components. We then prune the irrelevant bases and enhance the model with new bases beneficial for specific applications. Deep compression results on the Llama 2-7b and -13B models, conducted on target applications including mathematical reasoning and code generation, show that our method significantly reduces model size while maintaining comparable accuracy to state-of-the-art low-rank compression techniques.
翻译:大语言模型(LLMs)显著提升了各类应用的性能,但其计算密集且能耗巨大。这使得在资源受限的设备(如个人计算机和移动/可穿戴设备)上部署面临挑战,并在云服务器等资源丰富的环境中产生高昂的推理成本。为拓展LLMs的应用范围,我们提出一种低秩分解方法,可根据特定应用需求有效压缩这些模型。我们观察到,基于通用数据集预训练的LLMs包含许多特定应用不需要的冗余成分。本方法着重于识别并移除这些冗余部分,仅保留目标应用所需的必要元素。具体而言,我们将LLMs的权重矩阵表示为基分量的线性组合,随后剪枝无关基并增强对特定应用有益的新基。在Llama 2-7b与-13B模型上针对数学推理和代码生成等目标应用进行的深度压缩实验表明,本方法在保持与前沿低秩压缩技术相当精度的同时,显著减小了模型规模。