CCVS: Context-aware Controllable Video Synthesis

Guillaume Le Moing Jean Ponce Cordelia Schmid

NeurIPS 2021

Abstract

This presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones, with several new key elements for improved spatial resolution and realism: It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control. The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module. Adversarial training of the autoencoder in the appearance and temporal domains is used to further improve the realism of its output. A quantizer inserted between the encoder and the transformer in charge of forecasting future frames in latent space (and its inverse inserted between the transformer and the decoder) adds even more flexibility by affording simple mechanisms for handling multimodal ancillary information for controlling the synthesis process (eg, a few sample frames, an audio track, a trajectory in image space) and taking into account the intrinsically uncertain nature of the future by allowing multiple predictions. Experiments with an implementation of the proposed approach give very good qualitative and quantitative results on multiple tasks and standard benchmarks.

Links

ArXiv preprint

Code

Citation

@inproceedings{lemoing2021ccvs,
  title     = {{CCVS}: Context-aware Controllable Video Synthesis},
  author    = {Guillaume Le Moing and Jean Ponce and Cordelia Schmid},
  booktitle = {NeurIPS},
  year      = {2021}
}

Results

Future video prediction

16-frame video with the start frame taken from a real video

Real start frame

Synthetic video

Point-to-point video synthesis

16-frame video with the start & end frames taken from a real video

Real start frame

Synthetic video

Real end frame

State-conditioned video synthesis

16-frame video with the start frame & robotic arm trajectory taken from a real video

Real start frame

Synthetic video + target real trajectory (white cross, best viewed by playing individual videos in fullscreen mode)

Unconditional video synthesis

16-frame video using start token to launch the synthesis process

Synthetic video

16-frame video using StyleGAN2 to synthesize the start frame

Synthetic video

Future video prediction

16-frame video with 5 priming frames taken from a real video

Synthetic video showing sport motion

Synthetic video showing body motion

Synthetic video showing hand motion

Synthetic video showing camera motion

Future video prediction

16-frame video with the start frame taken from a real video

Synthetic video

Sound-conditioned video synthesis

90-frame video with 15 priming frames & corresponding audio track for all timesteps taken from a real video

Pair of real video (left) & synthetic one (right) with the original sound added on top

Layout-conditioned video synthesis

30-frame video with 3 priming frames & corresponding semantic layouts for all timesteps taken from a real video

Combination of real video (left), real layout (middle) & synthetic video (right)

Class-conditioned video synthesis

64-frame video corresponding to a given action label

Class

Synthetic video

bend

jack

jump

pjump

run

side

skip

wave1

wave2

walk

Acknowledgments

This work was granted access to the HPC resources of IDRIS under the allocation 2020-AD011012227 made by GENCI. It was funded in part by the French government under management of Agence Nationale de la Recherche as part of the "Investissements d’avenir" program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). JP was supported in part by the Louis Vuitton/ENS chair in artificial intelligence and the Inria/NYU collaboration.

Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.

CCVS: Context-aware Controllable Video Synthesis

Abstract

Links

Citation

Results

BAIR Robot Pushing (256 × 256)

Future video prediction

Point-to-point video synthesis

State-conditioned video synthesis

Unconditional video synthesis

Kinetic-600 (64 × 64)

Future video prediction

UCF-101 (256 × 256)

Future video prediction

Audioset-Drum (128 × 230)

Sound-conditioned video synthesis

Cityscapes (256 × 512)

Layout-conditioned video synthesis

Weizmann (128 × 128)

Class-conditioned video synthesis

Acknowledgments

Copyright Notice