Understanding the behaviors and intentions of humans are one of the main challenges autonomous ground vehicles still faced with. More specifically, when it comes to complex environments such as urban traffic scenes, inferring the intentions and actions of vulnerable road users such as pedestrians become even harder. In this paper, we address the problem of intent action prediction of pedestrians in urban traffic environments using only image sequences from a monocular RGB camera. We propose a real-time framework that can accurately detect, track and predict the intended actions of pedestrians based on a tracking-by-detection technique in conjunction with a novel spatio-temporal DenseNet model. We trained and evaluated our framework based on real data collected from urban traffic environments. Our framework has shown resilient and competitive results in comparison to other baseline approaches. Overall, we achieved an average precision score of 84.76% with a real-time performance at 20 FPS.
翻译:了解人类的行为和意图是仍然面临的主要挑战之一。更具体地说,在城市交通场景等复杂环境中,我们更难以推断行人等脆弱的道路使用者的意图和行动。在本文件中,我们只使用单镜 RGB 相机的图像序列来解决城市交通环境中行人的意图行动预测问题。我们提出了一个实时框架,可以准确检测、跟踪和预测行人打算采取的行动,这一框架以跟踪和检测技术为基础,并结合一个新型的spatio-时空DenseNet模型。我们根据从城市交通环境中收集的真实数据培训和评估了我们的框架。与其他基线方法相比,我们的框架显示了具有弹性和竞争性的结果。总体而言,我们实现了84.76%的平均精确分数,实际表现为20个FPS。