Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers. Project page at https://compvis.github.io/taming-transformers/ .
翻译:变压器旨在学习关于相继数据的远程互动。变压器在设计上继续显示关于各种任务的最新艺术结果。 与CNN相比,它们并不包含优先进行本地互动的感应偏差。 这使得它们具有表达性,但也在计算上对长序列不可行, 如高分辨率图像。 我们展示了CNN的感应偏差的有效性如何与变压器的表达性结合起来,使他们能够建模,从而合成高分辨率图像。 我们展示了如何 (一) 使用CNN来学习内容丰富的图像成份词汇,而反过来(二) 利用变压器在高分辨率图像中有效地模拟其构成。 我们的方法很容易地应用于有条件的合成任务,在这些任务中,非空间信息,例如物体类别和空间信息,例如分层,都可以控制生成的图像。我们特别介绍了巨型像图像与变压器的语系制合成的第一个结果。项目网页 https://compvis.github.io/taming- transforps/。