Convolutional Neural Networks have been extensively explored in the task of automatic music tagging. The problem can be approached by using either engineered time-frequency features or raw audio as input. Modulation filter bank representations that have been actively researched as a basis for timbre perception have the potential to facilitate the extraction of perceptually salient features. We explore end-to-end learned front-ends for audio representation learning, ModNet and SincModNet, that incorporate a temporal modulation processing block. The structure is effectively analogous to a modulation filter bank, where the FIR filter center frequencies are learned in a data-driven manner. The expectation is that a perceptually motivated filter bank can provide a useful representation for identifying music features. Our experimental results provide a fully visualisable and interpretable front-end temporal modulation decomposition of raw audio. We evaluate the performance of our model against the state-of-the-art of music tagging on the MagnaTagATune dataset. We analyse the impact on performance for particular tags when time-frequency bands are subsampled by the modulation filters at a progressively reduced rate. We demonstrate that modulation filtering provides promising results for music tagging and feature representation, without using extensive musical domain knowledge in the design of this front-end.
翻译:自动音乐标记任务中广泛探索了进化神经网络。 这个问题可以通过工程设计的时间频率特性或原始音频作为输入来解决这个问题。 已经积极研究作为丁字形感知基础的移动式过滤银行代表机构有可能促进提取感官显著特征。 我们探索了用于音频演示学习的端到端学前端知识, ModNet 和 SincModNet, 其中包括一个时间调制处理块。 结构实际上类似于一个调制过滤库, 即FIR过滤中心频率以数据驱动的方式学习。 期望一个感知性的过滤库可以为识别音乐特征提供有用的代表。 我们的实验结果提供了完全可视性和可解释的原始音频前端时间调制解调。 我们对照在MagnaTagAtune数据集上标注的音乐的状态-艺术的性能评估我们模型的性能。 我们分析了当时间频频带以数据驱动的方式进行分解时对特定标签的性能影响。 我们的预期是,一个感知觉感知性的过滤器可以提供一种有希望的微的磁带的磁带的模型, 展示模型的模型, 以逐渐降低的磁带。