Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the \textbf{visual context consistency} with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. The experiments show that our method outperforms both zero-shot unified models and those with text CoTs in multi-modal generation, demonstrating higher visual context consistency.
翻译:近年来,思维链(Chain-of-Thought, CoT)的引入极大地提升了统一模型的生成能力。然而,我们观察到,当前生成过程中的思维主要聚焦于与文本提示的文本一致性,而忽视了在多模态生成(例如多参考生成)中与视觉参考图像的**视觉上下文一致性**。这种一致性的缺失导致模型难以维持关键的视觉特征(如人物身份、物体属性、风格)。为此,我们将视觉上下文一致性整合到统一模型的推理过程中,通过以下两种方式显式地激励模型维持这种一致性:1)自适应视觉规划:生成结构化的视觉检查清单,以明确需要保持一致的视觉元素;2)迭代视觉校正:在检查清单的指导下进行自我反思,并以迭代方式优化生成结果。为实现这一目标,我们采用监督微调来教导模型如何进行视觉检查规划、执行自我反思与自我优化,并利用flow-GRPO通过定制的视觉检查奖励进一步强化视觉一致性。实验表明,我们的方法在多模态生成任务中优于零样本统一模型及采用文本CoT的模型,展现出更高的视觉上下文一致性。