Learning cooperative multi-agent policies directly from high-dimensional, multimodal sensory inputs like pixels and audio (from pixels) is notoriously sample-inefficient. Model-free Multi-Agent Reinforcement Learning (MARL) algorithms struggle with the joint challenge of representation learning, partial observability, and credit assignment. To address this, we propose a novel framework based on a shared, generative Multimodal World Model (MWM). Our MWM is trained to learn a compressed latent representation of the environment's dynamics by fusing distributed, multimodal observations from all agents using a scalable attention-based mechanism. Subsequently, we leverage this learned MWM as a fast, "imagined" simulator to train cooperative MARL policies (e.g., MAPPO) entirely within its latent space, decoupling representation learning from policy learning. We introduce a new set of challenging multimodal, multi-agent benchmarks built on a 3D physics simulator. Our experiments demonstrate that our MWM-MARL framework achieves orders-of-magnitude greater sample efficiency compared to state-of-the-art model-free MARL baselines. We further show that our proposed multimodal fusion is essential for task success in environments with sensory asymmetry and that our architecture provides superior robustness to sensor-dropout, a critical feature for real-world deployment.
翻译:直接从高维多模态感官输入(如像素和音频(源自像素))中学习协作式多智能体策略,其样本效率极低。无模型多智能体强化学习(MARL)算法在表征学习、部分可观测性和信用分配的联合挑战中表现不佳。为解决这一问题,我们提出了一种基于共享生成式多模态世界模型(MWM)的新框架。我们的MWM通过使用可扩展的基于注意力的机制融合来自所有智能体的分布式多模态观测,学习环境动态的压缩潜在表征。随后,我们利用这一学习到的MWM作为快速“想象”模拟器,完全在其潜在空间中训练协作式MARL策略(例如MAPPO),从而将表征学习与策略学习解耦。我们基于3D物理模拟器引入了一套新的具有挑战性的多模态多智能体基准测试。实验表明,与最先进的无模型MARL基线相比,我们的MWM-MARL框架实现了数量级更高的样本效率。我们进一步证明,在具有感官不对称性的环境中,所提出的多模态融合对任务成功至关重要,且我们的架构对传感器失效具有更强的鲁棒性,这是实际部署中的关键特性。