Multimodal Large Language Models (MLLMs) deliver strong vision-language performance but at high computational cost, driven by numerous visual tokens processed by the Vision Transformer (ViT) encoder. Existing token pruning strategies are inadequate: LLM-stage token pruning overlooks the ViT's overhead, while conventional ViT token pruning, without language guidance, risks discarding textually critical visual cues and introduces feature distortions amplified by the ViT's bidirectional attention. To meet these challenges, we propose IPCV, a training-free, information-preserving compression framework for MLLM visual encoders. IPCV enables aggressive token pruning inside the ViT via Neighbor-Guided Reconstruction (NGR) that temporarily reconstructs pruned tokens to participate in attention with minimal overhead, then fully restores them before passing to the LLM. Besides, we introduce Attention Stabilization (AS) to further alleviate the negative influence from token pruning by approximating the K/V of pruned tokens. It can be directly applied to previous LLM-side token pruning methods to enhance their performance. Extensive experiments show that IPCV substantially reduces end-to-end computation and outperforms state-of-the-art training-free token compression methods across diverse image and video benchmarks. Our code is available at https://github.com/Perkzi/IPCV.
翻译:多模态大语言模型(MLLMs)虽在视觉-语言任务上表现出色,但其计算成本高昂,主要源于视觉Transformer(ViT)编码器需处理的大量视觉令牌。现有的令牌剪枝策略存在不足:在LLM阶段进行的令牌剪枝忽视了ViT自身的开销;而传统的ViT令牌剪枝由于缺乏语言指导,可能丢弃对文本理解至关重要的视觉线索,并会因ViT的双向注意力机制而放大特征失真。为应对这些挑战,我们提出了IPCV,一种无需训练、面向MLLM视觉编码器的信息保持压缩框架。IPCV通过邻居引导重建(NGR)在ViT内部实现激进的令牌剪枝:该方法临时重建被剪枝的令牌,使其能以最小开销参与注意力计算,随后在传递给LLM前将其完全恢复。此外,我们引入了注意力稳定化(AS)技术,通过近似被剪枝令牌的键/值(K/V)来进一步缓解令牌剪枝带来的负面影响。该技术可直接应用于现有的LLM侧令牌剪枝方法以提升其性能。大量实验表明,IPCV显著降低了端到端计算量,并在多种图像与视频基准测试中超越了当前最先进的无训练令牌压缩方法。我们的代码公开于 https://github.com/Perkzi/IPCV。