Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks. The codes will be available at: https://github.com/HITsz-TMG/MKS2-Multimodal-Knowledge-Storage-and-Sharing
翻译:近年来,多模态大语言模型(MLLMs)取得了显著进展,实现了可与GPT-4媲美的强大多模态生成能力。这些模型主要将视觉信息映射到语言表示空间,利用LLMs海量的知识储备和强大的文本生成能力来产生遵循指令的多模态响应。由于该方法利用LLMs进行视觉理解与推理,我们可将其称为“视觉应用型LLMs”;但我们同时观察到,这些MLLMs忽视了利用视觉知识来增强LLMs整体能力的潜力,这种范式可视为“视觉增强型LLMs”。本文提出一种名为MKS2的方法,旨在通过赋能LLMs中的多模态知识存储与共享来增强LLMs。具体而言,我们设计了模块化视觉记忆(MVM)组件,将其集成到LLMs的内部模块中,以高效存储开放世界的视觉信息。此外,我们在LLMs中提出了一种软性多模态专家混合(MoMEs)架构,用于在文本生成过程中调用多模态知识协同。综合实验表明,MKS2在需要物理或常识知识的语境中显著增强了LLMs的推理能力,同时在图文理解多模态基准测试中取得了具有竞争力的结果。代码将在以下地址公开:https://github.com/HITsz-TMG/MKS2-Multimodal-Knowledge-Storage-and-Sharing