Deep video recognition is more computationally expensive than image recognition, especially on large-scale datasets like Kinetics [1]. Therefore, training scalability is essential to handle a large amount of videos. In this paper, we study the factors that impact the training scalability of video networks. We recognize three bottlenecks, including data loading (data movement from disk to GPU), communication (data movement over networking), and computation FLOPs. We propose three design guidelines to improve the scalability: (1) fewer FLOPs and hardware-friendly operator to increase the computation efficiency; (2) fewer input frames to reduce the data movement and increase the data loading efficiency; (3) smaller model size to reduce the networking traffic and increase the networking efficiency. With these guidelines, we designed a new operator Temporal Shift Module (TSM) that is efficient and scalable for distributed training. TSM model can achieve 1.8x higher throughput compared to previous I3D models. We scale up the training of the TSM model to 1,536 GPUs, with a mini-batch of 12,288 video clips/98,304 images, without losing the accuracy. With such hardware-aware model design, we are able to scale up the training on Summit supercomputer and reduce the training time on Kinetics dataset from 49 hours 55 minutes to 14 minutes 13 seconds, achieving a top-1 accuracy of 74.0%, which is 1.6x and 2.9x faster than previous 3D video models with higher accuracy. The code and more details can be found here: http://tsm-hanlab.mit.edu.
翻译:深层视频识别比图像识别成本更昂贵,特别是动因[1]等大型数据集。因此,培训可缩放性对于处理大量视频至关重要。在本文中,我们研究了影响视频网络培训可缩放性的因素。我们认识到三个瓶颈,包括数据装载(数据从磁盘移动到GPU)、通信(通过网络进行数据移动)和计算FLOPs。我们提议了三个设计指南,以提高可缩放性:(1) FLOPs和硬件友好型操作器,以提高计算效率;(2)减少用于减少数据移动和提高数据负荷效率的输入框架;(3)减少网络流量和提高网络效率的模型规模较小;(3)减少网络流量和提高网络效率的模型规模。根据这些指南,我们设计了55个新的操作器“TSM”模块(TSM),该模块高效且可升级到分布式培训。TSMSM模式比以前的I3D模型高出1.8倍。我们将TSMSM模式的培训提高到1 536 GPPPP, 其小批量为12288视频剪98、304图像,但不会失去准确性。49个视频版本的版本,其精度将达到13个标准。在标准模型上,在标准模型上可以降低上,在标准上,在13个标准上,在标准上,在标准模型上,在标准模型上可以降低。在标准上,在标准上,在标准上,在标准上,在标准上,在标准值为14个标准上,在标准上,在标准上,在标准上,在14个模型上,在标准上,在标准上,在标准上,在标准上,在标准上,在标准上,在标准上,在标准上,在标准上,在标准上,在标准上,在标准上,在标准上,在标准上,在标准上,在标准上,在标准上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在上,在