Speech emotion recognition is a crucial problem manifesting in a multitude of applications such as human computer interaction and education. Although several advancements have been made in the recent years, especially with the advent of Deep Neural Networks (DNN), most of the studies in the literature fail to consider the semantic information in the speech signal. In this paper, we propose a novel framework that can capture both the semantic and the paralinguistic information in the signal. In particular, our framework is comprised of a semantic feature extractor, that captures the semantic information, and a paralinguistic feature extractor, that captures the paralinguistic information. Both semantic and paraliguistic features are then combined to a unified representation using a novel attention mechanism. The unified feature vector is passed through a LSTM to capture the temporal dynamics in the signal, before the final prediction. To validate the effectiveness of our framework, we use the popular SEWA dataset of the AVEC challenge series and compare with the three winning papers. Our model provides state-of-the-art results in the valence and liking dimensions.
翻译:尽管近年来取得了一些进步,特别是深神经网络(DNN)的出现,但文献中的大多数研究都未能考虑语音信号中的语义信息。在本文中,我们提出了一个新框架,既能捕捉信号中的语义和语言信息,又能捕捉信号中的语义和语言信息。特别是,我们的框架包括一个语义特征提取器,能够捕捉语义信息,以及一个副语言特征提取器,能够捕捉语言信息。然后将语义特征和旁听特征结合到一个使用新式关注机制的统一表述中。统一的特性矢量通过LSTM通过一个LSTM来捕捉信号中的时间动态,在最后预测之前。为了验证我们的框架的有效性,我们使用流行的SEWA挑战系列数据集,并与三个获胜的论文进行比较。我们的模型提供了价值和感知层面的最新结果。