SAVIR-T: 以变换器进行空间感应视觉理性 (SAViR-T: Spatially Attentive Visual Reasoning with Transformers)

We present a novel computational model, "SAViR-T", for the family of visual reasoning problems embodied in the Raven's Progressive Matrices (RPM). Our model considers explicit spatial semantics of visual elements within each image in the puzzle, encoded as spatio-visual tokens, and learns the intra-image as well as the inter-image token dependencies, highly relevant for the visual reasoning task. Token-wise relationship, modeled through a transformer-based SAViR-T architecture, extract group (row or column) driven representations by leveraging the group-rule coherence and use this as the inductive bias to extract the underlying rule representations in the top two row (or column) per token in the RPM. We use this relation representations to locate the correct choice image that completes the last row or column for the RPM. Extensive experiments across both synthetic RPM benchmarks, including RAVEN, I-RAVEN, RAVEN-FAIR, and PGM, and the natural image-based "V-PROM" demonstrate that SAViR-T sets a new state-of-the-art for visual reasoning, exceeding prior models' performance by a considerable margin.

翻译：我们展示了一个新型的计算模型“SAVIR-T”,用于拉文进步矩阵(RPM)中体现的视觉推理问题组。我们的模型考虑了拼图中每个图像中视觉元素的清晰空间语义,将它编码为spatio-visual matters, 并学习与视觉推理任务高度相关的图像内部和图像间象征依赖性。由基于变压器的SAVIR-T结构模型、利用集团规则一致性的提取组(行或列)驱动的演示模型,并以此为诱导偏差来提取RPM中上两行(或列)每个标志中基本规则的表达。我们使用这种关联表达来定位完成RPM最后一行或列的正确选择图像。在合成RPM基准(包括RAVEN、I-RAVEN、RAVEN-FIR和PGM)上的广泛实验,以及基于自然图像的“V-PROM”模型,表明SAVR-TRM-T用一个超越前视觉推理学的新的州位模型。