We present an approach to unsupervised audio representation learning. Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness. By applying Latent Semantic Indexing (LSI) we embed corresponding textual information into a latent vector space from which we derive track relatedness for online triplet selection. This LSI topic modelling facilitates fine-grained selection of similar and dissimilar audio-track pairs to learn the audio representation using a Convolution Recurrent Neural Network (CRNN). By this we directly project the semantic context of the unstructured text modality onto the learned representation space of the audio modality without deriving structured ground-truth annotations from it. We evaluate our approach on the Europeana Sounds collection and show how to improve search in digital audio libraries by harnessing the multilingual meta-data provided by numerous European digital libraries. We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection. The learned representations perform comparable to the baseline of handcrafted features, respectively exceeding this baseline in similarity retrieval precision at higher cut-offs with only 15\% of the baseline's feature vector length.
翻译:我们提出一种不受监督的音频代表学习方法。 基于三重神经网络结构, 我们利用与地震有关的音频网络结构, 利用与音频音频相关的超模式信息来估计音频音轨关联性。 我们通过应用远程语义索引(LSI), 将相应的文本信息嵌入潜向矢量空间, 我们从中获取在线三重选择的跟踪关联性。 这个 LSI 主题模型有利于使用 Convolution condition condition concental Neural 网络( CRNN) 微小选择相似和不同音轨配对来学习音频代表。 通过这个模型, 我们直接将非结构化文本模式的语义背景投放到音频模式的学习空间上, 而不从中得出结构化的地面图解。 我们评估了欧洲音频收藏的方法, 并展示了如何通过利用许多欧洲数字图书馆提供的多语言元数据改进数字音频图书馆的搜索。 我们显示, 我们的方法与各种说明风格以及这一收藏的不同语言不相容。 学习的表达方式只能与手写式基本特征的基线相比, 在相似的15级精确度上分别超过这一基线。