This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones. Individual images are decomposed into multiple layers combining object masks and a small set of control points. The layer structure is shared across all frames in each video to build dense inter-frame connections. Complex scene motions are modeled by combining parametric geometric transformations associated with individual layers, and video synthesis is broken down into discovering the layers associated with past frames, predicting the corresponding transformations for upcoming ones and warping the associated object regions accordingly, and filling in the remaining image parts. Extensive experiments on multiple benchmarks including urban videos (Cityscapes and KITTI) and videos featuring nonrigid motions (UCF-Sports and H3.6M), show that our method consistently outperforms the state of the art by a significant margin in every case. Code, pretrained models, and video samples synthesized by our approach can be found in the project webpage https://16lemoing.github.io/waldo.
翻译:本文提出了WALDO(Warping Layer-Decomposed Objects),一种从过去视频帧预测未来视频帧的新方法。该方法将单个图像分解为多个层,组合目标掩码和一小组控制点。每个视频中的所有帧共享层结构,构建密集的帧间连接。通过结合与单个图层相关联的参数几何变换,对复杂场景运动进行建模,并将视频合成分解为发现与过去帧相关联的层,预测相应的变换以及相应的物体区域,以及填充其余图像部分。在包括城市视频(Cityscapes和KITTI)和具有非刚性运动特征的视频(UCF-Sports和H3.6M)的多个基准测试中,广泛的实验证明我们的方法在每种情况下均显著优于现有技术。我们的方法的代码、预训练模型和视频样本可以在项目网页https://16lemoing.github.io/waldo上找到。