In this paper, we study Multiscale Vision Transformers (MViT) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTs' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViT has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 56.1 box AP on COCO object detection as well as 86.1% on Kinetics-400 video classification. Code and models will be made publicly available.
翻译:在本文中,我们研究多规模视觉变异器(MViT),作为图像和视频分类以及物体探测的统一架构。我们展示了经过改进的MViT版本,其中包含了分解的相对位置嵌入和剩余集合连接。我们用五大尺寸对这一结构进行即时处理,并在图像网络分类、COCOCO探测和动能视频识别方面对其进行评估,使其在超过先前工作效果的地方得到评估。我们进一步比较了MViT的集中关注点与窗口关注机制,因为后者在准确/计算方面优于后者。没有钟和告示,MViT在3个领域有最新性能:图像网络分类88.8%的准确度,COCO物体探测56.1箱AP,以及动能-400视频分类86.1%。代码和模型将公布。