Multimodal Large Language Models (MLLMs) have demonstrated exceptional success in various multimodal tasks, yet their deployment is frequently limited by substantial computational demands and prolonged inference times. Given that the vision modality typically contains more comprehensive information than the text modality, resulting in encoded representations comprising an extensive number of tokens, leading to significant computational overhead due to the quadratic complexity of the attention mechanism. Current token reduction methods are typically restricted to specific model architectures and often necessitate extensive retraining or fine-tuning, restricting their applicability to many state-of-the-art models. In this paper, we introduce a learning-free token reduction (LFTR) method designed for MLLMs. LFTR can be seamlessly integrated into most open-source MLLM architectures without requiring additional fine-tuning. By capitalizing on the redundancy in visual representations, our approach effectively reduces tokens while preserving the general inference performance of MLLMs. We conduct experiments on multiple MLLM architectures (LLaVA, MiniGPT, QwenVL), and our results show that LFTR achieves up to a $16\times$ reduction of visual tokens while maintaining or even enhancing performance on mainstream vision question-answering benchmarks, all in a learning-free setting. Additionally, LFTR is complementary to other acceleration techniques, such as vision encoder compression and post-training quantization, further promoting the efficient deployment of MLLMs. Our project is available at https://anonymous.4open.science/r/LFTR-AAAI-0528.
翻译:暂无翻译