Novel object synthesis by integrating distinct textual concepts from diverse categories remains a significant challenge in Text-to-Image (T2I) generation. Existing methods often suffer from insufficient concept mixing, lack of rigorous evaluation, and suboptimal outputs-manifesting as conceptual imbalance, superficial combinations, or mere juxtapositions. To address these limitations, we propose Reinforcement Mixing Learning (RMLer), a framework that formulates cross-category concept fusion as a reinforcement learning problem: mixed features serve as states, mixing strategies as actions, and visual outcomes as rewards. Specifically, we design an MLP-policy network to predict dynamic coefficients for blending cross-category text embeddings. We further introduce visual rewards based on (1) semantic similarity and (2) compositional balance between the fused object and its constituent concepts, optimizing the policy via proximal policy optimization. At inference, a selection strategy leverages these rewards to curate the highest-quality fused objects. Extensive experiments demonstrate RMLer's superiority in synthesizing coherent, high-fidelity objects from diverse categories, outperforming existing methods. Our work provides a robust framework for generating novel visual concepts, with promising applications in film, gaming, and design.
翻译:在文本到图像(T2I)生成中,通过整合来自不同类别的独特文本概念来合成新颖物体仍然是一个重大挑战。现有方法通常存在概念混合不充分、缺乏严格评估以及输出结果欠佳等问题,表现为概念不平衡、浅层组合或仅仅是概念并置。为解决这些局限性,我们提出了强化混合学习(RMLer),该框架将跨类别概念融合构建为一个强化学习问题:混合特征作为状态,混合策略作为动作,视觉结果作为奖励。具体而言,我们设计了一个MLP策略网络来预测用于混合跨类别文本嵌入的动态系数。我们进一步引入了基于(1)语义相似性和(2)融合物体与其组成概念之间组合平衡性的视觉奖励,并通过近端策略优化来优化策略。在推理阶段,一种选择策略利用这些奖励来筛选出最高质量的融合物体。大量实验证明,RMLer在从多样类别合成连贯、高保真物体方面具有优越性,超越了现有方法。我们的工作为生成新颖视觉概念提供了一个稳健的框架,在电影、游戏和设计领域具有广阔的应用前景。