Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.
翻译:文本到图像(T2I)扩散模型能够生成高质量图像,但常常难以捕捉文本提示中指定的空间关系。这一局限性可追溯至两个因素:训练数据中缺乏细粒度空间监督,以及文本嵌入无法编码空间语义。我们提出InfSplign,一种无需训练的推理时方法,它通过在每一步去噪过程中使用复合损失调整噪声来改善空间对齐。所提出的损失利用从骨干解码器中提取的不同层级的交叉注意力图,在采样过程中强制实现准确的物体放置和均衡的物体存在性。该方法轻量、即插即用,且兼容任何扩散骨干网络。我们在VISOR和T2I-CompBench上的综合评估表明,InfSplign(据我们所知)确立了新的最先进水平,相较于现有最强的推理时基线方法实现了显著的性能提升,甚至超越了基于微调的方法。代码库已在GitHub上开源。