This letter describes a network that is able to capture spatiotemporal correlations over arbitrary timestamps. The proposed scheme operates as a complementary, extended network over spatiotemporal regions. Recently, multimodal fusion has been extensively researched in deep learning. For action recognition, the spatial and temporal streams are vital components of deep Convolutional Neural Network (CNNs), but reducing the occurrence of overfitting and fusing these two streams remain open problems. The existing fusion approach is to average the two streams. To this end, we propose a correlation network with a Shannon fusion to learn a CNN that has already been trained. Long-range video may consist of spatiotemporal correlation over arbitrary times. This correlation can be captured using simple fully connected layers to form the correlation network. This is found to be complementary to the existing network fusion methods. We evaluate our approach on the UCF-101 and HMDB-51 datasets, and the resulting improvement in accuracy demonstrates the importance of multimodal correlation.
翻译:这封信描述了一个能够捕捉任意时间戳上的时空相关性的网络。 拟议的计划是一个补充性、 扩大的网络, 覆盖时空区域。 最近, 在深层学习中广泛研究了多式聚合。 对于行动识别, 空间和时间流是深层进化神经网络( CNNs) 的重要组成部分, 但减少这两个流的过度装配和引信的发生仍然是尚未解决的问题。 现有的聚合方法是平均两种流。 为此, 我们提议建立一个与香农聚合的关联网络, 以学习已经受过训练的CNN。 远程视频可能包含任意时间的时空关联。 可以使用简单的完全连接的层来捕捉这种关联关系, 以形成关联网络。 这被认为是对现有网络集成方法的补充。 我们评估了我们在UCF-101 和 HMDB-51 数据集上的做法, 并由此提高的精确度, 显示了多式相关性的重要性 。