视觉生成调优 (Visual Generation Tuning)

Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at https://github.com/hustvl/VGT.

翻译：大规模视觉语言模型通过广泛的预训练有效弥合了模态鸿沟，获得了与语言对齐的复杂视觉表征。然而，这些为多模态理解任务优化的表征是否蕴含视觉生成的内在潜力，目前仍未得到充分探索。本文提出VGT（视觉生成调优），一种旨在激发任意视觉语言模型中视觉生成潜力的新范式。通过对预训练良好的视觉语言模型进行高效的视觉生成调优，我们显著降低了对齐成本，并加速了连续空间中自回归建模的收敛速度（20倍加速）。具体而言，我们摒弃了为扩散变换器设计的纠缠像素级变分自编码器，通过将预训练视觉语言模型的语义编码器与像素解码器的潜在表征对齐，构建了VGT-AE。在图像重建任务中，我们在28倍压缩比下实现了26.67 PSNR和0.50 rFID，超越了专用变分自编码器；在视觉生成任务中，我们在自回归模型中取得了最先进的结果——GenEval上0.77分，DPG-Bench上78.73分。此外，我们提出的VGT展现出显著的扩展潜力，能够灵活地为任何为多模态理解训练的视觉语言模型赋予视觉生成能力，这为探索下一代统一多模态基础模型开辟了新路径。模型与代码已发布于https://github.com/hustvl/VGT。