Contrastive representation learning of videos highly relies on the availability of millions of unlabelled videos. This is practical for videos available on web but acquiring such large scale of videos for real-world applications is very expensive and laborious. Therefore, in this paper we focus on designing video augmentation for self-supervised learning, we first analyze the best strategy to mix videos to create a new augmented video sample. Then, the question remains, can we make use of the other modalities in videos for data mixing? To this end, we propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities. We find that our video mixing strategy STC-mix, i.e. preliminary mixing of videos followed by CMMC across different modalities in a video, improves the quality of learned video representations. We conduct thorough experiments for two downstream tasks: action recognition and video retrieval on two small scale video datasets UCF101, and HMDB51. We also demonstrate the effectiveness of our STC-mix on NTU dataset where domain knowledge is limited. We show that the performance of our STC-mix on both the downstream tasks is on par with the other self-supervised approaches while requiring less training data.
翻译:视频的对比性学习高度依赖于数百万个未贴标签的视频的可用性。 这是网络上可用的视频的实用方法, 但获取如此庞大的视频用于真实世界应用的视频非常昂贵和艰苦。 因此, 在本文中,我们侧重于设计视频增强以自我监督的学习, 我们首先分析将视频混合的最佳策略, 以创建新的强化视频样本。 然后, 问题仍然存在, 我们能否在视频中使用其他模式进行数据混合? 为此, 我们提议跨模版Mandal Manifoldy Cutmix( CMMC) 将视频Tesseract( CMMC) 插入到功能空间中另一个视频 Tesseract( ) 。 我们发现, 我们的视频混合战略STC- mix( STC- mix), 即CMMCC 在不同模式中初步混合视频, 提高视频演示质量, 以创建新的视频样本。 我们为两个下游任务进行彻底实验: 在两个小型视频数据集UC101 和 HMDB51 。 我们还展示了我们在NTU- mix 数据系统中的ST- mix 的有效性, 其域知识在下游任务中都要求更低的功能上的表现。