DarSwin：扭曲感知的径向Swin Transformer (DarSwin: Distortion Aware Radial Swin Transformer)

Wide-angle lenses are commonly used in perception tasks requiring a large field of view. Unfortunately, these lenses produce significant distortions making conventional models that ignore the distortion effects unable to adapt to wide-angle images. In this paper, we present a novel transformer-based model that automatically adapts to the distortion produced by wide-angle lenses. We leverage the physical characteristics of such lenses, which are analytically defined by the radial distortion profile (assumed to be known), to develop a distortion aware radial swin transformer (DarSwin). In contrast to conventional transformer-based architectures, DarSwin comprises a radial patch partitioning, a distortion-based sampling technique for creating token embeddings, and a polar position encoding for radial patch merging. We validate our method on classification tasks using synthetically distorted ImageNet data and show through extensive experiments that DarSwin can perform zero-shot adaptation to unseen distortions of different wide-angle lenses. Compared to other baselines, DarSwin achieves the best results (in terms of Top-1 and -5 accuracy), when tested on in-distribution data, with almost 2% (6%) gain in Top-1 accuracy under medium (high) distortion levels, and comparable to the state-of-the-art under low and very low distortion levels (perspective-like images).

翻译：广角镜头常用于需要大视野的感知任务中。不幸的是，这些镜头产生的扭曲会使传统模型无法适应广角图像。在本文中，我们提出了一种新颖的基于Transformer的模型，自适应于广角镜头产生的扭曲效应。我们利用这些镜头的物理特性，这些特性是由径向畸变曲线（假定是已知的）解析地定义的，以开发一种扭曲感知的径向Swin Transformer（DarSwin）。与传统的Transformer架构相比，DarSwin包括一个径向块分区，一种基于扭曲的采样技术用于创建标记嵌入，以及一种极坐标位置编码用于径向块合并。我们在使用合成扭曲的ImageNet数据进行分类任务时验证了我们的方法，并通过广泛的实验表明，DarSwin可以对不同广角镜头的未见扭曲进行零-shot适应。与其他基准线相比，在测试中在分布内数据上取得了最佳结果（以Top-1和-5准确性衡量），在中等（高）扭曲水平下Top-1精度提高了近2％（6％），在低和非常低的扭曲水平下（像透视图像一样）与最先进技术相当。