从剪枝到嫁接：通过可学习的层融合实现动态知识再分配 (From Pruning to Grafting: Dynamic Knowledge Redistribution via Learnable Layer Fusion)

Structured pruning of Generative Pre-trained Transformers (GPTs) offers a promising path to efficiency but often suffers from irreversible performance degradation due to the discarding of transformer blocks. In this paper, we introduce FuseGPT, a compression paradigm that reframes structured pruning as iterative knowledge grafting rather than simple removal. Motivated by the observation that linear block merging fails to capture non-linear feature disparities and that block importance fluctuates dynamically during pruning, FuseGPT employs a dual-strategy pipeline. First, we propose Macro Influence (MI), a dynamic fusion-aware metric that continuously re-evaluates block redundancy as the network topology evolves. Second, instead of rigid parameter averaging, we introduce a learnable low-rank fusion mechanism that adaptively grafts the knowledge of pruned blocks onto surviving layers via lightweight local distillation. Extensive experiments on LLaMA, Mistral, Qwen, and Phi families demonstrate that FuseGPT establishes a new state-of-the-art on the compression-accuracy Pareto frontier: at 25\% sparsity, FuseGPT achieves lower perplexity than prior methods at 20\% sparsity, improves zero-shot reasoning by up to 4.5 points, and delivers 1.33$\times$ inference speedup with 25\% memory reduction. Furthermore, FuseGPT is orthogonal to quantization, achieving 52.1\% total compression with negligible quality loss when combined with 4-bit GPTQ. We make our code publicly available at https://github.com/JarvisPei/FuseGPT.

翻译：生成式预训练Transformer（GPT）的结构化剪枝为实现高效性提供了一条有前景的路径，但由于丢弃了Transformer模块，常常导致不可逆的性能下降。本文介绍了FuseGPT，一种将结构化剪枝重新定义为迭代知识嫁接而非简单移除的压缩范式。受线性模块合并无法捕捉非线性特征差异以及模块重要性在剪枝过程中动态波动的观察启发，FuseGPT采用了一种双策略流程。首先，我们提出了宏观影响（Macro Influence, MI），这是一种动态融合感知度量，随着网络拓扑结构的演变持续重新评估模块冗余度。其次，我们引入了一种可学习的低秩融合机制，取代了僵化的参数平均，通过轻量级的局部蒸馏，自适应地将被剪枝模块的知识嫁接至存留层。在LLaMA、Mistral、Qwen和Phi系列模型上进行的大量实验表明，FuseGPT在压缩-准确率的帕累托前沿上确立了新的最优水平：在25%稀疏度下，FuseGPT实现了比先前方法在20%稀疏度下更低的困惑度，将零样本推理能力提升了高达4.5个百分点，并带来了1.33倍的推理加速和25%的内存减少。此外，FuseGPT与量化技术正交，当与4位GPTQ结合时，实现了52.1%的总压缩率，且质量损失可忽略不计。我们的代码已在https://github.com/JarvisPei/FuseGPT公开提供。