In the context of music information retrieval, similarity-based approaches are useful for a variety of tasks that benefit from a query-by-example scenario. Music however, naturally decomposes into a set of semantically meaningful factors of variation. Current representation learning strategies pursue the disentanglement of such factors from deep representations, resulting in highly interpretable models. This allows the modeling of music similarity perception, which is highly subjective and multi-dimensional. While the focus of prior work is on metadata driven notions of similarity, we suggest to directly model the human notion of multi-dimensional music similarity. To achieve this, we propose a multi-input deep neural network architecture, which simultaneously processes mel-spectrogram, CENS-chromagram and tempogram in order to extract informative features for the different disentangled musical dimensions: genre, mood, instrument, era, tempo, and key. We evaluated the proposed music similarity approach using a triplet prediction task and found that the proposed multi-input architecture outperforms a state of the art method. Furthermore, we present a novel multi-dimensional analysis in order to evaluate the influence of each disentangled dimension on the perception of music similarity.
翻译:在音乐信息检索方面,基于相似性的方法对于从逐个查询的情景中得益的各种任务非常有用。但是,音乐自然地分解成一系列具有生理意义的变异因素。目前的代表学习战略力求将这些因素与深层的表达方式脱钩,从而形成高度可解释的模式。这样可以模拟音乐相似性概念,这种概念具有高度主观性和多维性。虽然先前工作的重点是由元数据驱动的相似性概念,但我们建议直接模拟多维音乐相似性的人类概念。为此,我们建议建立一个多投入的深线网络结构,同时处理中光谱、CENS-色谱和色谱等过程,以提取不同不相交的音乐层面的信息特征:genre、情绪、乐器、时代、节奏和关键。我们用三重预测任务评估了拟议的音乐相似性方法,发现拟议的多维化结构超越了艺术方法的状态。此外,我们提出了一个新型的多维维度分析,以便评估每个不同层面的音乐影响。