With the increasing size of large language models, layer pruning has gained increased attention as a hardware-friendly approach for model compression. However, existing layer pruning methods struggle to simultaneously address key practical deployment challenges, including performance degradation, high training costs, and limited acceleration. To overcome these limitations, we propose \name, a task-\underline{E}ffective, training-\underline{E}conomical and inference-\underline{E}fficient layer pruning framework. \namespace introduces two key innovations: (1) a differentiable mask optimization method using a Gumbel-TopK sampler, enabling efficient and precise pruning mask search; and (2) an entropy-aware adaptive knowledge distillation strategy that enhances task performance. Extensive experiments over diverse model architectures and benchmarks demonstrate the superiority of our method over state-of-the-art approaches. Notably, \namespace achieves 96\% accuracy, a mere 0.8\% drop from the original model (96.8\%) on MATH-500 when pruning 25\% layers of Qwen3-32B, outperforming existing SOTA (95\%), with a 1.33$\times$ inference speedup by consuming merely 0.5B tokens (0.5\% of the post-training data volume).
翻译:随着大型语言模型规模的不断增大,层级剪枝作为一种硬件友好的模型压缩方法受到越来越多的关注。然而,现有的层级剪枝方法难以同时应对实际部署中的关键挑战,包括性能下降、训练成本高昂以及加速效果有限。为克服这些限制,我们提出了\\name,一个任务有效、训练经济且推理高效的层级剪枝框架。\\namespace引入了两项关键创新:(1)采用Gumbel-TopK采样器的可微分掩码优化方法,实现了高效且精确的剪枝掩码搜索;(2)一种熵感知的自适应知识蒸馏策略,以提升任务性能。在不同模型架构和基准测试上的广泛实验证明了我们方法相对于现有最先进方法的优越性。值得注意的是,在Qwen3-32B模型上剪除25%的层级时,\\namespace在MATH-500数据集上达到了96%的准确率,仅比原始模型(96.8%)下降了0.8%,优于现有SOTA方法(95%),同时通过仅消耗0.5B token(相当于后训练数据量的0.5%)实现了1.33倍的推理加速。