Automatic emotion recognition (AER) is a challenging task due to the abstract concept and multiple expressions of emotion. Although there is no consensus on a definition, human emotional states usually can be apperceived by auditory and visual systems. Inspired by this cognitive process in human beings, it's natural to simultaneously utilize audio and visual information in AER. However, most traditional fusion approaches only build a linear paradigm, such as feature concatenation and multi-system fusion, which hardly captures complex association between audio and video. In this paper, we introduce factorized bilinear pooling (FBP) to deeply integrate the features of audio and video. Specifically, the features are selected through the embedded attention mechanism from respective modalities to obtain the emotion-related regions. The whole pipeline can be completed in a neural network. Validated on the AFEW database of the audio-video sub-challenge in EmotiW2018, the proposed approach achieves an accuracy of 62.48%, outperforming the state-of-the-art result.
翻译:由于抽象的概念和情感的多种表达方式,自动情感识别(AER)是一项具有挑战性的任务。虽然在定义上没有共识,但人类情感状态通常可以被听力和视觉系统所接受。受人类认知过程的启发,自然会同时使用AER的视听信息。然而,大多数传统的聚合方法只是建立一个线性模式,例如特征融合和多系统融合,很难捕捉音频和视频之间的复杂联系。在本文中,我们引入了分计双线集合(FBP),以深入整合音像的特征。具体地说,通过嵌入式关注机制从各自的模式中选择这些特征,以获取与情感有关的区域。整个管道可以在神经网络中完成。在EmotiW2018的视听亚光速子挑战数据库中验证,拟议方法实现了62.48%的准确率,超过了最新结果。