Wide-angle lenses are commonly used in perception tasks requiring a large field of view. Unfortunately, these lenses produce significant distortions making conventional models that ignore the distortion effects unable to adapt to wide-angle images. In this paper, we present a novel transformer-based model that automatically adapts to the distortion produced by wide-angle lenses. We leverage the physical characteristics of such lenses, which are analytically defined by the radial distortion profile (assumed to be known), to develop a distortion aware radial swin transformer (DarSwin). In contrast to conventional transformer-based architectures, DarSwin comprises a radial patch partitioning, a distortion-based sampling technique for creating token embeddings, and a polar position encoding for radial patch merging. We validate our method on classification tasks using synthetically distorted ImageNet data and show through extensive experiments that DarSwin can perform zero-shot adaptation to unseen distortions of different wide-angle lenses. Compared to other baselines, DarSwin achieves the best results (in terms of Top-1 and -5 accuracy), when tested on in-distribution data, with almost 2% (6%) gain in Top-1 accuracy under medium (high) distortion levels, and comparable to the state-of-the-art under low and very low distortion levels (perspective-like images).
翻译:广角镜头常用于需要大视野的感知任务中。不幸的是,这些镜头产生的扭曲会使传统模型无法适应广角图像。在本文中,我们提出了一种新颖的基于Transformer的模型,自适应于广角镜头产生的扭曲效应。我们利用这些镜头的物理特性,这些特性是由径向畸变曲线(假定是已知的)解析地定义的,以开发一种扭曲感知的径向Swin Transformer(DarSwin)。与传统的Transformer架构相比,DarSwin包括一个径向块分区,一种基于扭曲的采样技术用于创建标记嵌入,以及一种极坐标位置编码用于径向块合并。我们在使用合成扭曲的ImageNet数据进行分类任务时验证了我们的方法,并通过广泛的实验表明,DarSwin可以对不同广角镜头的未见扭曲进行零-shot适应。与其他基准线相比,在测试中在分布内数据上取得了最佳结果(以Top-1和-5准确性衡量),在中等(高)扭曲水平下Top-1精度提高了近2%(6%),在低和非常低的扭曲水平下(像透视图像一样)与最先进技术相当。