Recently, Transformers have shown promising performance in various vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute, especially for the high-resolution vision tasks. Local self-attention performs attention computation within a local region to improve its efficiency, which leads to their receptive fields in a single attention layer are not large enough, resulting in insufficient context modeling. When observing a scene, humans usually focus on a local region while attending to non-attentional regions at coarse granularity. Based on this observation, we develop the axially expanded window self-attention mechanism that performs fine-grained self-attention within the local window and coarse-grained self-attention in the horizontal and vertical axes, and thus can effectively capturing both short- and long-range visual dependencies.
翻译:最近,变异器在各种愿景任务中表现出了有希望的绩效。变异器设计中一个具有挑战性的问题是,全球自我关注在计算上非常昂贵,特别是高分辨率的愿景任务。 本地自我关注在当地区域内部进行关注计算,以提高其效率,从而导致在单一关注层中形成其可容纳的字段,但规模不够大,导致环境模型不足。在观察场景时,人类通常集中在一个本地区域,同时在粗颗粒状态下关注非保护区域。 基于这一观察,我们开发了一个逐步扩大的窗口自我关注机制,在当地窗口内进行精细的自我关注,并在横向和垂直轴线上进行粗略的自我关注,从而能够有效地捕捉到短期和长期的视觉依赖性。