The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition. We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we propose a curriculum training scheme to predict further into the future with progressively less temporal context. This encourages the model to only encode slowly varying spatial-temporal signals, therefore leading to semantic representations; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset with self-supervised learning, and then finetuning the representation on a downstream task, i.e. action recognition. With single stream~(RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101(75.7% top1 acc) and HMDB51(35.7% top1 acc), outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.
翻译:本文的目标是从视频中自行监督地学习时空嵌入,这适合于人类行动的认知。 我们做出三点贡献: 首先,我们引入了在视频中进行自我监管的代表学习的常识化编码框架(DPC),这通过反复预测未来代表来学习时空块密集的编码; 第二,我们提出一个课程培训计划,以便在逐渐减少时间背景的情况下进一步预测未来。 这鼓励模型只编码缓慢变化的空间时空信号,从而导致语义表达; 第三,我们通过首先培训动因-400数据集的DPC模型,进行自我监管学习,然后调整下游任务的代表,即行动识别。单流~(仅RGB),DPC预先培训的演示在UCFC101(75.7%最高1 cc)和HMDB51(35.7% 最高1 acc)上一级的表现,通过显著的基线,超越了以往的所有学习方法。