Contrastive Vision-Language Pre-training (CLIP) has drown increasing attention recently for its transferable visual representation learning. Supervised by large-scale image-text pairs, CLIP is able to align paired images and texts and thus conduct zero-shot recognition in open-vocabulary scenarios. However, there exists semantic gap between the specific application and generally pre-trained knowledge, which makes the matching sub-optimal on downstream tasks. In this paper, we propose VT-CLIP to enhance vision-language modeling via visual-guided texts. Specifically, we guide the text feature to adaptively explore informative regions on the image and aggregate the visual feature by cross-attention machanism. In this way, the visual-guided text become more semantically correlated with the image, which greatly benefits the matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets and experiment extensive ablation studies to demonstrate the effectiveness of VT-CLIP. The code will be released soon.
翻译:最近,在大型图像-文本配对的监督下,CLIP能够对配对图像和文本进行对齐,从而在开放式词汇假设中进行零光识别;然而,在具体应用和一般培训前知识之间存在语义上的差距,这使得在下游任务上匹配的亚最佳水平。在本文中,我们提议VT-CLIP通过视觉-指导文本加强视觉-语言建模。具体地说,我们指导文本特征,以适应性的方式探索图像信息区域,并通过交叉注意机械化将视觉特征汇总在一起。这样,视觉制导文本与图像的语义关系就变得更为密切,这对匹配过程大有裨益。在几个镜头中,我们用11个众所周知的分类数据集来评估我们的VT-CLIP,并试验广泛的实验性研究,以显示VT-CLIP的有效性。该代码将很快发布。