This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.
翻译:本文提出了JavisGPT,这是首个用于联合音视频理解与生成的统一多模态大语言模型。JavisGPT采用简洁的编码器-LLM-解码器架构,其特点是包含一个用于时空音视频融合的SyncFusion模块,以及用于桥接预训练JAV-DiT生成器的同步感知可学习查询。该设计使得模型能够基于多模态指令进行时序连贯的视频-音频理解与生成。我们设计了一个有效的三阶段训练流程,包括多模态预训练、音视频微调和大规模指令调优,以从现有的视觉-语言模型逐步构建多模态理解与生成能力。为此,我们进一步构建了JavisInst-Omni,这是一个包含超过20万条由GPT-4o策划的音视频文本对话的高质量指令数据集,涵盖了多样化、多层次的理解与生成场景。在音视频理解与生成基准测试上的大量实验表明,JavisGPT优于现有的多模态大语言模型,尤其是在复杂和需要时间同步的场景中。