Given a video of a person in action, we can easily guess the 3D future motion of the person. In this work, we present perhaps the first approach for predicting a future 3D mesh model sequence of a person from past video input. We do this for periodic motions such as walking and also actions like bowling and squatting seen in sports or workout videos. While there has been a surge of future prediction problems in computer vision, most approaches predict 3D future from 3D past or 2D future from 2D past inputs. In this work, we focus on the problem of predicting 3D future motion from past image sequences, which has a plethora of practical applications in autonomous systems that must operate safely around people from visual inputs. Inspired by the success of autoregressive models in language modeling tasks, we learn an intermediate latent space on which we predict the future. This effectively facilitates autoregressive predictions when the input differs from the output domain. Our approach can be trained on video sequences obtained in-the-wild without 3D ground truth labels. The project website with videos can be found at https://jasonyzhang.com/phd.
翻译:根据行动的一个人的视频,我们可以很容易地猜到一个人的3D未来运动。在这项工作中,我们也许提出第一个预测未来3D网目模型序列的方法,从过去的视频输入中预测一个人未来的3D网目模型序列。我们这样做是为了进行定期运动,例如行走和在体育或健身录像中看到保龄球和蹲下等行动。虽然在计算机视觉中出现了未来的预测问题激增,但大多数方法预测3D未来是从过去3D或2D未来,从过去2D输入中预测。在这项工作中,我们侧重于预测过去图像序列中的3D未来运动的问题,这在自主系统中有大量的实际应用,必须安全地在视觉输入的人周围操作。在自动递增模型任务的成功激励下,我们学习了一种中间的潜在空间,我们据此预测未来。当输入与输出域不同时,这有效地促进了自动递增预测。我们的方法可以就在没有3D地面标签的情况下在网上获得的视频序列进行培训。项目网站可在https://jasonzhang.com/phd上找到带视频。