Most recent works on sentiment analysis have exploited the text modality. However, millions of hours of video recordings posted on social media platforms everyday hold vital unstructured information that can be exploited to more effectively gauge public perception. Multimodal sentiment analysis offers an innovative solution to computationally understand and harvest sentiments from videos by contextually exploiting audio, visual and textual cues. In this paper, we, firstly, present a first of its kind Persian multimodal dataset comprising more than 800 utterances, as a benchmark resource for researchers to evaluate multimodal sentiment analysis approaches in Persian language. Secondly, we present a novel context-aware multimodal sentiment analysis framework, that simultaneously exploits acoustic, visual and textual cues to more accurately determine the expressed sentiment. We employ both decision-level (late) and feature-level (early) fusion methods to integrate affective cross-modal information. Experimental results demonstrate that the contextual integration of multimodal features such as textual, acoustic and visual features deliver better performance (91.39%) compared to unimodal features (89.24%).
翻译:最近关于情绪分析的最新著作利用了文本模式,然而,社交媒体平台上每天张贴数百万小时的视频记录,掌握着重要的非结构化信息,可以更有效地衡量公众的感知。多模式情绪分析提供了一种创新的解决方案,通过从背景角度利用音频、视觉和文字提示来计算理解和从视频中获取情感。在本文中,我们首先提出了由800多语种组成的第一组波斯多式联运数据集,作为研究人员评价波斯语多式联运情绪分析方法的基准资源。第二,我们提出了一个新的符合环境的多式联运情绪分析框架,同时利用声学、视觉和文字提示来更准确地确定表达的情绪。我们采用决策级(晚级)和特征级(早期)融合方法整合影响性跨模式信息。实验结果表明,文字、声学和视觉特征等多种特征的背景整合与单式特征(89.24%)相比,效果更好(91.39 % )。