CoVLR: 协调跨模态一致性与单模态结构的视觉语言检索 (CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure for Vision-Language Retrieval)

Current vision-language retrieval aims to perform cross-modal instance search, in which the core idea is to learn the consistent visionlanguage representations. Although the performance of cross-modal retrieval has greatly improved with the development of deep models, we unfortunately find that traditional hard consistency may destroy the original relationships among single-modal instances, leading the performance degradation for single-modal retrieval. To address this challenge, in this paper, we experimentally observe that the vision-language divergence may cause the existence of strong and weak modalities, and the hard cross-modal consistency cannot guarantee that strong modal instances' relationships are not affected by weak modality, resulting in the strong modal instances' relationships perturbed despite learned consistent representations.To this end, we propose a novel and directly Coordinated VisionLanguage Retrieval method (dubbed CoVLR), which aims to study and alleviate the desynchrony problem between the cross-modal alignment and single-modal cluster-preserving tasks. CoVLR addresses this challenge by developing an effective meta-optimization based strategy, in which the cross-modal consistency objective and the intra-modal relation preserving objective are acted as the meta-train and meta-test tasks, thereby CoVLR encourages both tasks to be optimized in a coordinated way. Consequently, we can simultaneously insure cross-modal consistency and intra-modal structure. Experiments on different datasets validate CoVLR can improve single-modal retrieval accuracy whilst preserving crossmodal retrieval capacity compared with the baselines.

翻译：当前的视觉语言检索旨在进行跨模态实例搜索，其核心思想是学习一致的视觉语言表示。虽然随着深度模型的发展，跨模态检索的性能已经得到了极大的提升，但我们不幸地发现，传统的硬一致性可能会破坏单模态实例之间的原始关系，导致单模态检索性能的降低。为了解决这个挑战，本文实验观察到视觉语言的差异可能会导致强弱模态的存在，并且硬跨模态一致性不能保证强模态实例的关系不受弱模态的影响，导致强模态实例的关系被扰动，尽管学习了一致的表示。为此，我们提出了一种新的、直接的协调视觉语言检索方法（称为CoVLR），旨在研究和缓解跨模态对齐和单模态聚类保持任务之间的不同步问题。CoVLR通过开发一种有效的元最优化策略来解决这个挑战，其中跨模态一致性目标和单模态关系保持目标被视为元训练和元测试任务，因此CoVLR鼓励两个任务协调优化。因此，我们可以同时保证跨模态一致性和单模态结构。在不同的数据集上进行的实验验证了CoVLR相较基线模型可以提高单模态检索精度，同时保持跨模态检索能力。