Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions. MAST significantly outperforms AST~\cite{gong2021ast} by 22.2\%, 4.4\% and 4.7\% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. On the downloaded AudioSet dataset, which has over 20\% missing audios, MAST also achieves slightly better accuracy than AST. In addition, MAST is 5x more efficient in terms of multiply-accumulates (MACs) with 42\% reduction in the number of parameters compared to AST. Through clustering metrics and visualizations, we demonstrate that the proposed MAST can learn semantically more separable feature representations from audio signals.
翻译:音频事件在时间和频率上都具有分层结构,并且可以分组在一起构建更抽象的语义音频类别。在本文中,我们开发了一种多尺度音频频谱变换器(MAST),采用分层表示学习进行高效音频分类。具体而言,MAST 在不同阶段沿时间(和频率)域采用一维(和二维)池操作,并逐渐减少代币数量和增加特征维度。MAST 的表现显著优于 AST~\cite{gong2021ast},在没有外部训练数据的情况下,在 Kinetics-Sounds、Epic-Kitchens-100 和 VGGSound 上的 top-1 准确率分别提高了22.2\%、4.4\%和4.7\%。在下载的 AudioSet 数据集上,其有超过20\% 的缺失音频,MAST 也比 AST 稍微提高了准确性。此外,与 AST 相比,MAST 的乘积累加(MACs)效率提高了5 倍,参数数量减少了42\%。通过聚类指标和可视化,我们证明了所提出的 MAST 可以从音频信号中学习到更具语义分离的特征表示。