Interactive autonomous applications require robustness of the perception engine to artifacts in unconstrained videos. In this paper, we examine the effect of camera motion on the task of action detection. We develop a novel ranking method to rank videos based on the degree of global camera motion. For the high ranking camera videos we show that the accuracy of action detection is decreased. We propose an action detection pipeline that is robust to the camera motion effect and verify it empirically. Specifically, we do actor feature alignment across frames and couple global scene features with local actor-specific features. We do feature alignment using a novel formulation of the Spatio-temporal Sampling Network (STSN) but with multi-scale offset prediction and refinement using a pyramid structure. We also propose a novel input dependent weighted averaging strategy for fusing local and global features. We show the applicability of our network on our dataset of moving camera videos with high camera motion (MOVE dataset) with a 4.1% increase in frame mAP and 17% increase in video mAP.
翻译:互动自主应用程序要求对不受限制的视频中的文物进行感知引擎的稳健性。 在本文中, 我们检查摄像运动对行动检测任务的影响。 我们开发了一种根据全球摄像运动程度对视频进行排序的新颖排序方法。 对于高级摄像视频, 我们显示动作检测的准确性下降。 我们建议了一种对相机动作效果具有强力的动作检测管道, 并用经验来验证它。 具体地说, 我们做一个跨框架的演员特征调整, 以及同时将全球场景特征与当地演员特点结合起来。 我们确实使用Spatio- 时空取样网络(STSN)的新配方, 使用金字塔结构进行多尺度的抵消预测和精细化。 我们还提出了一种基于新颖输入的加权平均战略, 用于使用本地和全球特征。 我们用高摄像机动作( MOVE数据集) 移动相机视频的数据集中, 我们的网络适用性, 框架 mAP增加了4.1%, 视频 mAP增加了17% 。