The performances of Sound Event Detection (SED) systems are greatly limited by the difficulty in generating large strongly labeled dataset. In this work, we used two main approaches to overcome the lack of strongly labeled data. First, we applied heavy data augmentation on input features. Data augmentation methods used include not only conventional methods used in speech/audio domains but also our proposed method named FilterAugment. Second, we propose two methods to utilize weak predictions to enhance weakly supervised SED performance. As a result, we obtained the best PSDS1 of 0.4336 and best PSDS2 of 0.8161 on the DESED real validation dataset. This work is submitted to DCASE 2021 Task4 and is ranked on the 3rd place.
翻译:由于难以生成大量贴有强烈标签的数据集,声音事件探测系统的性能受到很大限制。在这项工作中,我们采用了两个主要方法来克服缺乏贴有强烈标签的数据的问题。首先,我们在输入功能上采用了重数据扩增方法。所使用的数据扩增方法不仅包括语言/音域域中使用的传统方法,还包括我们提议的称为过滤器的方法。第二,我们提出两种方法,利用薄弱的预测来提高微弱的监视系统性能。因此,我们在DESED真实验证数据集上获得了最佳的DPS1,0.4336,和最佳的DPS2,0.8161。这项工作已提交DCASE 2021任务4, 排在第3位。