跨模态细粒度对齐：基于粒度感知与区域不确定性建模的方法 (Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling)

Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.

翻译：细粒度图像-文本对齐是多模态学习中的关键挑战，支撑着视觉问答、图像描述生成和视觉-语言导航等重要应用。与全局对齐不同，细粒度对齐要求局部视觉区域与文本标记之间实现精确对应，但常受限于噪声注意力机制和跨模态关系的过度简化建模。本研究指出现有方法存在两个根本性局限：一是缺乏稳健的模态内机制来评估视觉与文本标记的重要性，导致复杂场景下泛化能力不足；二是缺少细粒度不确定性建模，无法捕捉区域-词汇对应关系中一对多和多对一的本质特性。针对这些问题，我们提出一种统一框架，融合重要性感知与粒度感知建模以及区域级不确定性建模。该方法利用模态特异性偏置识别显著特征，无需依赖脆弱的跨模态注意力机制，并通过高斯混合分布表示区域特征以捕捉细粒度不确定性。在Flickr30K和MS-COCO数据集上的大量实验表明，本方法在不同骨干架构下均实现了最先进的性能，显著提升了细粒度图像-文本对齐的鲁棒性与可解释性。