Post-training model pruning is a promising solution, yet it faces a trade-off: simple heuristics that zero weights are fast but degrade accuracy, while principled joint optimization methods recover accuracy but are computationally infeasible at modern scale. One-shot methods such as SparseGPT offer a practical trade-off in optimality by applying efficient, approximate heuristic weight updates. To close this gap, we introduce OPTIMA, a practical one-shot post-training pruning method that balances accuracy and scalability. OPTIMA casts layer-wise weight reconstruction after mask selection as independent, row-wise Quadratic Programs (QPs) that share a common layer Hessian. Solving these QPs yields the per-row globally optimal update with respect to the reconstruction objective given the estimated Hessian. The shared-Hessian structure makes the problem highly amenable to batching on accelerators. We implement an accelerator-friendly QP solver that accumulates one Hessian per layer and solves many small QPs in parallel, enabling one-shot post-training pruning at scale on a single accelerator without fine-tuning. OPTIMA integrates with existing mask selectors and consistently improves zero-shot performance across multiple LLM families and sparsity regimes, yielding up to 3.97% absolute accuracy improvement. On an NVIDIA H100, OPTIMA prunes a 8B-parameter transformer end-to-end in 40 hours with 60GB peak memory. Together, these results set a new state-of-the-art accuracy-efficiency trade-offs for one-shot post-training pruning.
翻译:训练后模型剪枝是一种有前景的解决方案,但其面临权衡:简单的启发式权重归零方法速度快但会降低精度,而基于联合优化的原理性方法能恢复精度但在现代规模下计算不可行。诸如SparseGPT等单次方法通过应用高效、近似的启发式权重更新,在最优性方面提供了实用的折衷方案。为弥合这一差距,我们提出了OPTIMA——一种平衡精度与可扩展性的实用单次训练后剪枝方法。OPTIMA在掩码选择后将逐层权重重构问题转化为共享层Hessian矩阵的独立行向二次规划问题。求解这些QP可在给定估计Hessian矩阵的条件下,获得相对于重构目标的每行全局最优更新。共享Hessian结构使得该问题非常适合在加速器上进行批处理。我们实现了一个加速器友好的QP求解器,该求解器每层累积一个Hessian矩阵并并行求解多个小型QP,从而无需微调即可在单个加速器上实现大规模单次训练后剪枝。OPTIMA可与现有掩码选择器集成,在多种LLM系列和稀疏度范围内持续提升零样本性能,实现最高3.97%的绝对精度提升。在NVIDIA H100上,OPTIMA端到端剪枝80亿参数Transformer仅需40小时且峰值内存占用为60GB。这些成果共同为单次训练后剪枝树立了精度-效率权衡的新标杆。