Vision Transformers have shown great performance in single tasks such as classification and segmentation. However, real-world problems are not isolated, which calls for vision transformers that can perform multiple tasks concurrently. Existing multi-task vision transformers are handcrafted and heavily rely on human expertise. In this work, we propose a novel one-shot neural architecture search framework, dubbed AutoTaskFormer (Automated Multi-Task Vision TransFormer), to automate this process. AutoTaskFormer not only identifies the weights to share across multiple tasks automatically, but also provides thousands of well-trained vision transformers with a wide range of parameters (e.g., number of heads and network depth) for deployment under various resource constraints. Experiments on both small-scale (2-task Cityscapes and 3-task NYUv2) and large-scale (16-task Taskonomy) datasets show that AutoTaskFormer outperforms state-of-the-art handcrafted vision transformers in multi-task learning. The entire code and models will be open-sourced.
翻译:视觉Transformer在单一任务如图像分类和分割中表现出了优异的性能。然而,实际问题并非孤立的,这就需要能够同时执行多个任务的视觉Transformer。现有的多任务视觉Transformer是手工制作的,严重依赖人工专业知识。在这项工作中,我们提出了一个名为AutoTaskFormer(自动化多任务视觉Transformer)的新型一次性神经架构搜索框架,来自动化这个过程。AutoTaskFormer 不仅可以自动识别在多个任务之间共享的权重,而且还为不同的资源限制提供了数千个参数(如头数和网络深度)范围内经过良好训练的视觉Transformer,以便进行部署和应用。在小规模数据集(2任务的Cityscapes和3任务的NYUv2)和大规模数据集(16任务的Taskonomy)上的实验表明,AutoTaskFormer在多任务学习中优于现有的手工制作的视觉Transformer。整个代码和模型都将开源。