视觉增强型大语言模型：赋能LLMs中的多模态知识存储与共享 (Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs) - 专知论文

会员服务 ·

0

多模 · 模态 · 知识 · 多模态 · 存储 ·

Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

翻译：视觉增强型大语言模型：赋能LLMs中的多模态知识存储与共享

Yunxin Li,Zhenyu Liu,Baotian Hu,Wei Wang,Yuxin Ding,Xiaochun Cao,Min Zhang

from arxiv, 21 pages, 7 figures; Accepted by IEEE TIP

Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks. The codes will be available at: https://github.com/HITsz-TMG/MKS2-Multimodal-Knowledge-Storage-and-Sharing

翻译：近年来，多模态大语言模型（MLLMs）取得了显著进展，实现了可与GPT-4媲美的强大多模态生成能力。这些模型主要将视觉信息映射到语言表示空间，利用LLMs海量的知识储备和强大的文本生成能力来产生遵循指令的多模态响应。由于该方法利用LLMs进行视觉理解与推理，我们可将其称为“视觉应用型LLMs”；但我们同时观察到，这些MLLMs忽视了利用视觉知识来增强LLMs整体能力的潜力，这种范式可视为“视觉增强型LLMs”。本文提出一种名为MKS2的方法，旨在通过赋能LLMs中的多模态知识存储与共享来增强LLMs。具体而言，我们设计了模块化视觉记忆（MVM）组件，将其集成到LLMs的内部模块中，以高效存储开放世界的视觉信息。此外，我们在LLMs中提出了一种软性多模态专家混合（MoMEs）架构，用于在文本生成过程中调用多模态知识协同。综合实验表明，MKS2在需要物理或常识知识的语境中显著增强了LLMs的推理能力，同时在图文理解多模态基准测试中取得了具有竞争力的结果。代码将在以下地址公开：https://github.com/HITsz-TMG/MKS2-Multimodal-Knowledge-Storage-and-Sharing

0

相关内容

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

专知会员服务

15+阅读 · 8月5日

[ICCV2025]EAMamba：面向图像恢复的高效全能视觉状态空间模型

[ICCV2025]EAMamba：面向图像恢复的高效全能视觉状态空间模型

专知会员服务

5+阅读 · 7月1日

RAG与RAU：自然语言处理中的检索增强语言模型综述

RAG与RAU：自然语言处理中的检索增强语言模型综述

专知会员服务

87+阅读 · 2024年5月3日

51页《基于Transformer的多模态与自监督学习》最新报告，Google Xiaohua Zhai

51页《基于Transformer的多模态与自监督学习》最新报告，Google Xiaohua Zhai

专知会员服务

68+阅读 · 2023年2月24日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

【CVPR2021】CausalVAE: 引入因果结构的解耦表示学习

【CVPR2021】CausalVAE: 引入因果结构的解耦表示学习

专知

19+阅读 · 2021年3月28日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

论文浅尝 | GEOM-GCN: Geometric Graph Convolutional Networks

论文浅尝 | GEOM-GCN: Geometric Graph Convolutional Networks

开放知识图谱

14+阅读 · 2020年4月8日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

开放知识图谱

22+阅读 · 2018年9月26日

语义Web知识库补全关键技术研究

国家自然科学基金

17+阅读 · 2017年12月31日

2D/3D视觉信息融合仿生SLAM关键问题研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于格值逻辑的语言真值α-群锁语义归结自动推理研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

基于结构学习的非平行支持向量机最优化方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

Arxiv

0+阅读 · 12月29日

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Arxiv

0+阅读 · 12月24日

ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge

Arxiv

0+阅读 · 12月23日

Odysseus: Jailbreaking Commercial Multimodal LLM-integrated Systems via Dual Steganography

Arxiv

0+阅读 · 12月23日

AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models

Arxiv

0+阅读 · 12月22日

VIP会员

文章信息

相关主题

相关VIP内容

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

专知会员服务

15+阅读 · 8月5日

[ICCV2025]EAMamba：面向图像恢复的高效全能视觉状态空间模型

[ICCV2025]EAMamba：面向图像恢复的高效全能视觉状态空间模型

专知会员服务

5+阅读 · 7月1日

RAG与RAU：自然语言处理中的检索增强语言模型综述

RAG与RAU：自然语言处理中的检索增强语言模型综述

专知会员服务

87+阅读 · 2024年5月3日

51页《基于Transformer的多模态与自监督学习》最新报告，Google Xiaohua Zhai

51页《基于Transformer的多模态与自监督学习》最新报告，Google Xiaohua Zhai

专知会员服务

68+阅读 · 2023年2月24日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

热门VIP内容

开通专知VIP会员享更多权益服务

星链与未来战争

《黑蜂（Black Hummingbird）微型无人机》

《全球地缘政治环境中的反无人机系统互操作性》252页

《美国：为自动驾驶汽车铺平道路——未来出行已来》最新43页报告

相关资讯

【CVPR2021】CausalVAE: 引入因果结构的解耦表示学习

【CVPR2021】CausalVAE: 引入因果结构的解耦表示学习

专知

19+阅读 · 2021年3月28日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

论文浅尝 | GEOM-GCN: Geometric Graph Convolutional Networks

论文浅尝 | GEOM-GCN: Geometric Graph Convolutional Networks

开放知识图谱

14+阅读 · 2020年4月8日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

开放知识图谱

22+阅读 · 2018年9月26日

相关论文

SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

Arxiv

0+阅读 · 12月29日

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Arxiv

0+阅读 · 12月24日

ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge

Arxiv

0+阅读 · 12月23日

Odysseus: Jailbreaking Commercial Multimodal LLM-integrated Systems via Dual Steganography

Arxiv

0+阅读 · 12月23日

AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models

Arxiv

0+阅读 · 12月22日

相关基金

语义Web知识库补全关键技术研究

国家自然科学基金

17+阅读 · 2017年12月31日

2D/3D视觉信息融合仿生SLAM关键问题研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于格值逻辑的语言真值α-群锁语义归结自动推理研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

基于结构学习的非平行支持向量机最优化方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员