Speech Emotion Recognition (SER) task has known significant improvements over the last years with the advent of Deep Neural Networks (DNNs). However, even the most successful methods are still rather failing when adaptation to specific speakers and scenarios is needed, inevitably leading to poorer performances when compared to humans. In this paper, we present novel work based on the idea of teaching the emotion recognition network about speaker identity. Our system is a combination of two ACRNN classifiers respectively dedicated to speaker and emotion recognition. The first informs the latter through a Self Speaker Attention (SSA) mechanism that is shown to considerably help to focus on emotional information of the speech signal. Experiments on social attitudes database Att-HACK and IEMOCAP corpus demonstrate the effectiveness of the proposed method and achieve the state-of-the-art performance in terms of unweighted average recall.
翻译:过去几年来,随着深神经网络(DNN)的到来,情感识别(SER)的任务有了显著的改善,但是,当需要适应具体的演讲者和情景时,即使是最成功的方法也仍然相当失败,这不可避免地导致与人类相比,表现较差。在本文中,我们介绍了基于教授情绪识别网络关于声音认同的概念的新工作。我们的系统是两个分别致力于语音识别和情感识别的ACRN分类人员的综合体。第一个系统是通过自言自语关注(SSA)机制向后者通报的,这表明它大大有助于关注语音信号的情感信息。关于社会态度数据库Att-HACK和IEMOCAP的实验表明拟议方法的有效性,并实现了无重量平均回想的最先进的表现。