Multi-instance image generation (MIG) remains a significant challenge for modern diffusion models due to key limitations in achieving precise control over object layout and preserving the identity of multiple distinct subjects. To address these limitations, we introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation that is guided by both layout and reference images. Our approach integrates two key technical contributions: a Contextual Layout Anchoring (CLA) mechanism that incorporates the composite layout image into the generation context to robustly anchor the objects in their desired positions, and Identity Consistency Attention (ICA), an innovative attention mechanism that leverages contextual reference images to ensure the identity consistency of multiple instances. Recognizing the lack of large-scale, hierarchically-structured datasets for this task, we introduce IMIG-100K, the first dataset with detailed layout and identity annotations. Extensive experiments demonstrate that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.
翻译:多实例图像生成(MIG)对现代扩散模型仍是一项重大挑战,主要源于其在实现精确的对象布局控制与保持多个不同主体身份一致性方面存在关键局限。为解决这些局限,我们提出了ContextGen,一种基于布局与参考图像引导的新型扩散Transformer框架,用于多实例生成。我们的方法整合了两项核心技术贡献:一是上下文布局锚定(CLA)机制,该机制将复合布局图像融入生成上下文,以稳健地将对象锚定至期望位置;二是身份一致性注意力(ICA),这是一种创新的注意力机制,利用上下文参考图像确保多个实例的身份一致性。鉴于该任务缺乏大规模、层次化结构的数据集,我们引入了IMIG-100K,首个具备详细布局与身份标注的数据集。大量实验表明,ContextGen在控制精度、身份保真度及整体视觉质量上均优于现有方法,确立了新的技术前沿。