As a spontaneous expression of emotion on face, micro-expression is receiving increasing attention from the affective computing community. Whist better recognition accuracy is achieved by various deep learning (DL) techniques, one characteristic of micro-expression has been not fully exploited. That is, such facial movement is transient and sparsely localized through time. Therefore, the representation learned from a full video clip is usually redundant. On the other hand, methods utilizing the single apex frame require manual annotations and sacrifice the temporal dynamics. To simultaneously localize and recognize such fleeting facial movements, we propose a novel end-to-end deep learning architecture, referred to as Adaptive Key-frame Mining Network (AKMNet). Operating on the raw video clip of micro-expression, AKMNet is able to learn discriminative spatio-temporal representation by combining spatial features of self-learned local key frames and their global-temporal dynamics. Empirical and theoretical evaluations show advantages of the proposed approach with improved performance.
翻译:作为面部情感的自发表达,微表情正日益受到感知计算界的关注。通过各种深层次学习(DL)技术,可以提高体积的准确度。微表情的一个特征没有被充分利用。也就是说,这种面部运动是短暂的,而且随时间而分散。因此,从一个完整的视频剪辑中了解到的表情通常是多余的。另一方面,利用单一顶点框架的方法需要人工说明并牺牲时间动态。为了同时将这种瞬间面部运动定位和认识,我们建议建立一个新型的端到端深学习结构,称为适应式关键框架采矿网络(AKMNet)。在微表情原始视频剪上操作,AKMNet能够通过结合自学本地键框架的空间特征及其全球时空动态学习歧视性的时空代表。 实证和理论评估显示拟议方法的优势与改进性能。