ThinkGen：面向视觉生成的广义思维框架 (ThinkGen: Generalized Thinking for Visual Generation)

Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen

翻译：近年来，多模态大语言模型（MLLMs）的研究进展表明，思维链（CoT）推理能够为复杂理解任务提供系统化的解决方案。然而，该方法在生成任务中的扩展仍处于起步阶段，并受限于特定场景的机制，这阻碍了其泛化与适应能力。本文提出ThinkGen，首个在多种生成场景中显式利用MLLM的CoT推理进行驱动的视觉生成框架。ThinkGen采用解耦架构，包含一个预训练的MLLM和一个扩散Transformer（DiT）：其中MLLM根据用户意图生成定制化指令，DiT则依据这些指令生成高质量图像。我们进一步提出一种基于可分离GRPO的训练范式（SepGRPO），在MLLM与DiT模块之间交替进行强化学习。这种灵活的设计支持跨多样数据集的联合训练，从而为广泛的生成场景实现有效的CoT推理。大量实验表明，ThinkGen在多个生成基准测试中均取得了鲁棒且领先的性能。代码已开源：https://github.com/jiaosiyuu/ThinkGen