This paper introduces our solution, XM-ALIGN (Unified Cross-Modal Embedding Alignment Framework), proposed for the FAME challenge at ICASSP 2026. Our framework combines explicit and implicit alignment mechanisms, significantly improving cross-modal verification performance in both "heard" and "unheard" languages. By extracting feature embeddings from both face and voice encoders and jointly optimizing them using a shared classifier, we employ mean squared error (MSE) as the embedding alignment loss to ensure tight alignment between modalities. Additionally, data augmentation strategies are applied during model training to enhance generalization. Experimental results show that our approach demonstrates superior performance on the MAV-Celeb dataset. The code will be released at https://github.com/PunkMale/XM-ALIGN.
翻译:本文介绍了我们为ICASSP 2026 FAME挑战赛提出的解决方案——XM-ALIGN(统一跨模态嵌入对齐框架)。该框架结合了显式与隐式对齐机制,在“已听到”和“未听到”的语言场景下,显著提升了跨模态验证性能。通过从人脸编码器和语音编码器中提取特征嵌入,并利用共享分类器对其进行联合优化,我们采用均方误差(MSE)作为嵌入对齐损失,以确保不同模态间的紧密对齐。此外,在模型训练过程中应用了数据增强策略以提升泛化能力。实验结果表明,我们的方法在MAV-Celeb数据集上展现出优越的性能。代码将在 https://github.com/PunkMale/XM-ALIGN 发布。