In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question), for vision-language (VL) representation learning. Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks. To simplify the network architecture, we use a single transformer network and enforce multi-task learning during VL pre-training, which includes the image-text contrastive loss, image-text matching loss, and masked language modeling loss based on the bidirectional and the seq2seq attention mask. The same transformer network is used as the image encoder, the text encoder, or the fusion network in different pre-training tasks. Empirically, we observe less conflict among different tasks and achieve new state of the arts on visual question answering, COCO image captioning (cross-entropy optimization) and nocaps (in SPICE). On other downstream tasks, e.g., image-text retrieval, we also achieve competitive performance.
翻译:在本文中,我们提议建立一个单一的单自成型转折器(UFO),它能够处理单式输入(例如图像或语言)或多式联运输入(例如图像和问题),以便进行视觉语言(VL)代表学习; 现有方法通常为每个模式设计单独的网络和(或)用于多式联运任务的特定融合网络; 为了简化网络结构,我们使用单一变压器网络,并在VL培训前阶段实施多任务学习,其中包括图像-文字对比损失、图像-文字匹配损失和基于双向和后继2等关注面的蒙面遮蔽语言模型损失。 相同的变压器网络被用作不同的培训前任务中的图像编码器、文字编码器或融合网络。 为了简化网络结构,我们使用单一的变压器网络,在视觉问题解答、COCOC图像描述(跨式优化)和Necap(SPICE)方面实现新的艺术状态,还在其他下游任务上实现竞争性的图像检索,例如,我们也有竞争性的图像文本检索。