Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen
翻译:近年来,多模态大语言模型(MLLMs)的研究进展表明,思维链(CoT)推理能够为复杂理解任务提供系统化的解决方案。然而,该方法在生成任务中的扩展仍处于起步阶段,并受限于特定场景的机制,这阻碍了其泛化与适应能力。本文提出ThinkGen,首个在多种生成场景中显式利用MLLM的CoT推理进行驱动的视觉生成框架。ThinkGen采用解耦架构,包含一个预训练的MLLM和一个扩散Transformer(DiT):其中MLLM根据用户意图生成定制化指令,DiT则依据这些指令生成高质量图像。我们进一步提出一种基于可分离GRPO的训练范式(SepGRPO),在MLLM与DiT模块之间交替进行强化学习。这种灵活的设计支持跨多样数据集的联合训练,从而为广泛的生成场景实现有效的CoT推理。大量实验表明,ThinkGen在多个生成基准测试中均取得了鲁棒且领先的性能。代码已开源:https://github.com/jiaosiyuu/ThinkGen