We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to accommodate image editing at inference. Approaches that expose randomly permuted sequence orders to conventional autoregressive (AR) models in visual generation for bidirectional context either suffer from a decline in performance or compromise the flexibility in sequence order choice at inference. Instead, STAR utilizes traversal orders of uniform spanning trees sampled in a lattice defined by the positions of image patches. Traversal orders are obtained through breadth-first search, allowing us to efficiently construct a spanning tree whose traversal order ensures that the connected partial observation of the image appears as a prefix in the sequence through rejection sampling. Through the tailored yet structured randomized strategy compared to random permutation, STAR preserves the capability of postfix completion while maintaining sampling performance without any significant changes to the model architecture widely adopted in the language AR modeling.
翻译:我们提出生成树自回归(STAR)建模方法,该方法能够融入图像的先验知识(如中心偏置与局部性),在保持采样性能的同时,提供足够灵活的序列顺序以适应推理阶段的图像编辑。在视觉生成中,将随机置换的序列顺序暴露给传统自回归模型以实现双向上下文的方法,通常会导致性能下降或牺牲推理时序列顺序选择的灵活性。相比之下,STAR采用在图像块位置定义的网格中采样均匀生成树的遍历顺序。通过广度优先搜索获得遍历顺序,使我们能够高效构建生成树,其遍历顺序通过拒绝采样确保图像的连通部分观测在序列中以前缀形式出现。与随机置换相比,这种定制化且结构化的随机策略使STAR在保持语言自回归建模中广泛采用的模型架构不变的前提下,既维持了采样性能,又保留了后缀补全的能力。