We propose a new approach to video face recognition. Our component-wise feature aggregation network (C-FAN) accepts a set of face images of a subject as an input, and outputs a single feature vector as the face representation of the set for the recognition task. The whole network is trained in two steps: (i) train a base CNN for still image face recognition; (ii) add an aggregation module to the base network to learn the quality value for each feature component, which adaptively aggregates deep feature vectors into a single vector to represent the face in a video. C-FAN automatically learns to retain salient face features with high quality scores while suppressing features with low quality scores. The experimental results on three benchmark datasets, YouTube Faces, IJB-A, and IJB-S show that the proposed C-FAN network is capable of generating a compact feature vector with 512 dimensions for a video sequence by efficiently aggregating feature vectors of all the video frames to achieve state of the art performance.
翻译:我们提出一个新的视频面部识别方法。 我们的元件特征聚合网络(C-FAN)接受一组主题的面部图像作为输入,输出一个单一特性矢量作为识别任务集的面部表示。整个网络接受两个步骤的培训:(一) 培训一个有线电视新闻网基础,以保持图像面部识别;(二) 向基网络添加一个聚合模块,以学习每个特性组成部分的质量值,这些特性将深特性矢量根据需要将深度矢量并入一个单一矢量矢量中,以在视频中代表面部。 C-FAN自动学习保留高质量分数的突出面部特征,同时抑制低质量分数的特征。三个基准数据集(YouTube Faces、IJB-A和IJB-S)的实验结果显示,拟议的C-FAN网络能够通过高效率地将所有视频框架的特征矢量组合,实现艺术性能状态,生成一个512维度视频序列的紧凑特征矢量。