Future human action forecasting from partial observations of activities is an important problem in many practical applications such as assistive robotics, video surveillance and security. We present a method to forecast actions for the unseen future of the video using a neural machine translation technique that uses encoder-decoder architecture. The input to this model is the observed RGB video, and the target is to generate the future symbolic action sequence. Unlike most methods that predict frame or clip level predictions for some unseen percentage of video, we predict the complete action sequence that is required to accomplish the activity. To cater for two types of uncertainty in the future predictions, we propose a novel loss function. We show a combination of optimal transport and future uncertainty losses help to boost results. We evaluate our model in three challenging video datasets (Charades, MPII cooking and Breakfast). We outperform other state-of-the art techniques for frame based action forecasting task by 5.06\% on average across several action forecasting setups.
翻译:从部分活动观测到的未来人类行动预测是许多实际应用中的一个重要问题,例如辅助机器人、视频监视和安全。我们提出了一个方法,用使用编码器解码器结构的神经机器翻译技术预测视频的不可见未来行动。对模型的投入是观测到的 RGB 视频,目标是生成未来的象征性行动序列。与预测某些不可见的视频百分比的框架或剪辑水平预测的大多数方法不同,我们预测完成活动所需的完整行动序列。为了迎合未来预测中的两种不确定性,我们提出了一个新的损失函数。我们展示了最佳运输和未来不确定性损失的组合,以帮助提高结果。我们用三种具有挑战性的视频数据集(Charades、MPII烹饪和早餐)来评估我们的模型。我们比其他基于框架的行动预测技术要优于5.06 ⁇ /平均每5.06 ⁇,以若干行动预测设置的平均数为基础。