With the fast development of artificial intelligence and short videos, emotion recognition in short videos has become one of the most important research topics in human-computer interaction. At present, most emotion recognition methods still stay in a single modality. However, in daily life, human beings will usually disguise their real emotions, which leads to the problem that the accuracy of single modal emotion recognition is relatively terrible. Moreover, it is not easy to distinguish similar emotions. Therefore, we propose a new approach denoted as ICANet to achieve multimodal short video emotion recognition by employing three different modalities of audio, video and optical flow, making up for the lack of a single modality and then improving the accuracy of emotion recognition in short videos. ICANet has a better accuracy of 80.77% on the IEMOCAP benchmark, exceeding the SOTA methods by 15.89%.
翻译:随着人工智能和短视频的快速发展,短视频中的情感识别已成为人类-计算机互动中最重要的研究课题之一。目前,大多数情感识别方法仍以单一方式存在。然而,在日常生活中,人类通常会掩盖真实的情感,从而导致单一模式情感识别准确性相对可怕的问题。此外,区分类似情感并非易事。因此,我们提议采用一种新的方法,称为ICANet,通过采用三种不同的音频、视频和光学流模式,弥补单一模式的缺失,然后提高短视频中情感识别的准确性。 ICANet在IEMOCAP基准上比SOTA方法高出15.89%的80.77%。