WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction

framework

Abstract

This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones. Individual images are decomposed into multiple layers combining object masks and a small set of control points. The layer structure is shared across all frames in each video to build dense inter-frame connections. Complex scene motions are modeled by combining parametric geometric transformations associated with individual layers, and video synthesis is broken down into discovering the layers associated with past frames, predicting the corresponding transformations for upcoming ones and warping the associated object regions accordingly, and filling in the remaining image parts. Extensive experiments on multiple benchmarks including urban videos (Cityscapes and KITTI) and videos featuring nonrigid motions (UCF-Sports and H3.6M), show that our method consistently outperforms the state of the art by a significant margin in every case. Code, pretrained models, and video samples synthesized by our approach can be found in the project webpage.



Citation

@inproceedings{lemoing2022waldo,
  title     = {{WALDO}: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction},
  author    = {Guillaume Le Moing and Jean Ponce and Cordelia Schmid},
  booktitle = {ICCV},
  year      = {2023}
}

Results

Layer video decomposition

Layer masks and control points automatically extracted by WALDO from the videos

Future layer prediction

Motion vectors between predicted control points and past ones overlayed over the ground-truth video

Warping, fusion and inpainting

Actual synthesis of future frames which consists in warping input ones and filling in the empty image parts

WALDO

Combining all modules

Short-term prediction

Comparison to OMP (CVPR'20), VPCL (CVPR'22) and VPVFI (CVPR'22)

10 futures frames predicted from 4 past ones

Long-term prediction

Comparison to SLAMP (ICCV'21)

50 futures frames predicted from 4 past ones

Non-rigid prediction

Comparison to STRPM (CVPR'22)

10 futures frames predicted from 4 past ones

Ablation of our approach


Acknowledgments

We would like to thank Daniel Geng and Xinzhu Bei for clarifications on the evaluation process, and we are grateful to Pauline Sert for helpful feedback. This work was granted access to the HPC resources of IDRIS under the allocation 2021-AD011012227R1 made by GENCI. It was funded in part by the French government under management of Agence Nationale de la Recherche as part of the ``Investissements d’avenir'' program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute), and the ANR project VideoPredict, reference ANR-21-FAI1-0002-01. JP was supported in part by the Louis Vuitton/ENS chair in artificial intelligence and the Inria/NYU collaboration.


Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.


Copyright © Guillaume Le Moing