This paper proposes a Region-based Convolutional Recurrent Neural Network (R-CRNN) for audio event detection (AED). The proposed network is inspired by Faster-RCNN, a well known region-based convolutional network framework for visual object detection. Different from the original Faster-RCNN, a recurrent layer is added on top of the convolutional network to capture the long-term temporal context from the extracted high level features. While most of the previous works on AED generate predictions at frame level first, and then use post-processing to predict the onset/offset timestamps of events from a probability sequence; the proposed method generates predictions at event level directly and can be trained end-to-end with a multitask loss, which optimizes the classification and localization of audio events simultaneously. The proposed method is tested on DCASE 2017 Challenge dataset. To the best of our knowledge, R-CRNN is the best performing single-model method among all methods without using ensembles both on development and evaluation sets. Compared to the other region-based network for AED (R-FCN) with an event-based error rate (ER) of 0.18 on the development set, our method reduced the ER to half.
翻译:本文提出了用于音频事件探测的区域革命经常性神经网络(R-CRNN)建议。拟议网络的灵感来自Appear-RCNNN,这是一个以区域为主的视觉物体探测的众所周知的区域革命网络框架。不同于最初的Apper-RCNNN,在革命网络的顶部添加了一个经常性的层,以从提取的高水平特征中获取长期时间背景。虽然以前关于AED的大部分工作首先在框架一级产生预测,然后利用后处理从概率序列中预测事件的开始/取消时间标记;拟议方法直接在事件一级产生预测,并可在多任务损失的情况下经过培训,从而同时优化音频事件的分类和本地化。在DCASE 2017挑战数据集上测试了拟议方法。据我们所知,R-CRNN是所有方法中最佳的单一模式方法,不使用开发和评价组合。与基于事件序列的其他区域网络相比,用以事件为基础的0.18误率来降低我们以事件为基础的0.18误率。