交叉尺度分层变压器与对应增强注意力的结合用于鸟瞰图语义分割推断 (A Cross-Scale Hierarchical Transformer with Correspondence-Augmented Attention for inferring Bird's-Eye-View Semantic Segmentation)

As bird's-eye-view (BEV) semantic segmentation is simple-to-visualize and easy-to-handle, it has been applied in autonomous driving to provide the surrounding information to downstream tasks. Inferring BEV semantic segmentation conditioned on multi-camera-view images is a popular scheme in the community as cheap devices and real-time processing. The recent work implemented this task by learning the content and position relationship via the vision Transformer (ViT). However, the quadratic complexity of ViT confines the relationship learning only in the latent layer, leaving the scale gap to impede the representation of fine-grained objects. And their plain fusion method of multi-view features does not conform to the information absorption intention in representing BEV features. To tackle these issues, we propose a novel cross-scale hierarchical Transformer with correspondence-augmented attention for semantic segmentation inferring. Specifically, we devise a hierarchical framework to refine the BEV feature representation, where the last size is only half of the final segmentation. To save the computation increase caused by this hierarchical framework, we exploit the cross-scale Transformer to learn feature relationships in a reversed-aligning way, and leverage the residual connection of BEV features to facilitate information transmission between scales. We propose correspondence-augmented attention to distinguish conducive and inconducive correspondences. It is implemented in a simple yet effective way, amplifying attention scores before the Softmax operation, so that the position-view-related and the position-view-disrelated attention scores are highlighted and suppressed. Extensive experiments demonstrate that our method has state-of-the-art performance in inferring BEV semantic segmentation conditioned on multi-camera-view images.

翻译：鸟瞰图语义分割因其易于可视化和简单易用而被广泛应用于自动驾驶，为下游任务提供周围信息。考虑到成本和实时性，最近的工作使用多相机视图图像条件下的视觉变换器（ViT）学习内容和位置关系来实现此任务。然而，ViT 的二次复杂度限制了关系学习仅在潜在层中进行，从而限制了对细粒度对象的表征。加上他们简单的多视图特征融合方法并不符合表征 BEV 特征的信息吸收意图。为了解决这些问题，我们提出了一种新颖的交叉尺度分层变压器与对应增强注意力相结合的语义分割推断方法。具体而言，我们设计了一个分层框架来优化 BEV 特征表征，最后尺寸仅为最终分割的一半。为了节省这种分层框架所带来的计算增加，我们利用交叉尺度变压器以一种反向对齐的方式学习特征关系，并利用 BEV 特征的残差连接促进不同尺度之间的信息传递。我们提出了对应增强注意力以区分有益和无益的对应关系。它通过在 Softmax 操作之前放大注意分数的方式实现，以突出显示与位置视图相关和不相关的注意分数。广泛的实验表明，我们的方法在多相机视图图像条件下的 BEV 语义分割推断方面具有最先进的性能。