Cross-modal retrieval is to utilize one modality as a query to retrieve data from another modality, which has become a popular topic in information retrieval, machine learning, and database. How to effectively measure the similarity between different modality data is the major challenge of cross-modal retrieval. Although several reasearch works have calculated the correlation between different modality data via learning a common subspace representation, the encoder's ability to extract features from multi-modal information is not satisfactory. In this paper, we present a novel variational autoencoder (VAE) architecture for audio-visual cross-modal retrieval, by learning paired audio-visual correlation embedding and category correlation embedding as constraints to reinforce the mutuality of audio-visual information. On the one hand, audio encoder and visual encoder separately encode audio data and visual data into two different latent spaces. Further, two mutual latent spaces are respectively constructed by canonical correlation analysis (CCA). On the other hand, probabilistic modeling methods is used to deal with possible noise and missing information in the data. Additionally, in this way, the cross-modal discrepancy from intra-modal and inter-modal information are simultaneously eliminated in the joint embedding subspace. We conduct extensive experiments over two benchmark datasets. The experimental outcomes exhibit that the proposed architecture is effective in learning audio-visual correlation and is appreciably better than the existing cross-modal retrieval methods.
翻译:跨模式检索是使用一种方式作为查询,从另一种方式检索数据,这种方式已成为信息检索、机器学习和数据库中流行的主题。如何有效地衡量不同模式数据之间的相似性,是跨模式检索的主要挑战。虽然一些重新研究工程通过学习一个共同的子空间代表,计算了不同模式数据之间的相互关系,但编码器从多模式信息中提取特征的能力并不令人满意。在本文中,我们提出了一个新的变式自动编码器(VAE)结构,用于视听跨模式检索,其方法是学习配对的视听相关嵌入和类别相关嵌入,作为加强视听信息相互性的限制。一方面,音频编码器和视觉编码器单独将视听数据和视觉数据编码成两个不同的潜在空间。此外,两个相互潜在的空间分别由Canonic相关分析(CCA)构建。另一方面,比较式模型用于处理数据中可能存在的噪音和缺失的信息。此外,在这种方式上,跨模式的跨模式的互交式互换关联性嵌入和类别嵌入,是同时消除了内部模式和图像结构之间拟议中两个共同的互换式结果。