Guillaume Le Moing

PhD student, Inria

I am a third-year PhD student in computer vision at WILLOW, a joint research laboratory between the Department of Computer Science of École Normale Supérieure (ENS) and Inria Paris, working under the supervision of Cordelia Schmid and Jean Ponce. I am interested in computer vision, deep learning and generative modeling. I have been working on controllable image and video synthesis, with downstream prediction tasks like anticipating the future, and creative applications like content editing. I have received a MSc degree in Executive Engineering from École des Mines de Paris and a MSc degree in Artificial Intelligence, Systems and Data from PSL Research University.


News

09 / 2021
03 / 2021
01 / 2021
11 / 2020
I started my PhD at WILLOW.
05 / 2020
I started a 6-month internship at valeo.ai working with Tuan-Hung Vu, Himalaya Jain, Matthieu Cord and Patrick Perez.
01 / 2019
I started a 6-month internship in the AI department of IBM Research Japan.

Research

WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction
Guillaume Le Moing, Jean Ponce, Cordelia Schmid
arXiv preprint, 2022.
@inproceedings{lemoing2022waldo,
  title     = {{WALDO}: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction},
  author    = {Guillaume Le Moing and Jean Ponce and Cordelia Schmid},
  journal   = {arXiv preprint},
  year      = {2022}
}

This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones. Individual images are decomposed into multiple layers combining object masks and a small set of control points. The layer structure is shared across all frames in each video to build dense inter-frame connections. Complex scene motions are modeled by combining parametric geometric transformations associated with individual layers, and video synthesis is broken down into discovering the layers associated with past frames, predicting the corresponding transformations for upcoming ones and warping the associated object regions accordingly, and filling in the remaining image parts. Extensive experiments on the Cityscapes (resp. KITTI) dataset show that WALDO significantly outperforms prior works with, e.g., 3, 27, and 51% (resp. 5, 20 and 11%) relative improvement in SSIM, LPIPS and FVD metrics. Code, pretrained models, and video samples synthesized by our approach can be found in the project webpage.

CCVS: Context-aware Controllable Video Synthesis
Guillaume Le Moing, Jean Ponce, Cordelia Schmid
NeurIPS, 2021.
@inproceedings{lemoing2021ccvs,
  title     = {{CCVS}: Context-aware Controllable Video Synthesis},
  author    = {Guillaume Le Moing and Jean Ponce and Cordelia Schmid},
  booktitle = {NeurIPS},
  year      = {2021}
}

This presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones, with several new key elements for improved spatial resolution and realism: It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control. The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module. Adversarial training of the autoencoder in the appearance and temporal domains is used to further improve the realism of its output. A quantizer inserted between the encoder and the transformer in charge of forecasting future frames in latent space (and its inverse inserted between the transformer and the decoder) adds even more flexibility by affording simple mechanisms for handling multimodal ancillary information for controlling the synthesis process (e.g., a few sample frames, an audio track, a trajectory in image space) and taking into account the intrinsically uncertain nature of the future by allowing multiple predictions. Experiments with an implementation of the proposed approach give very good qualitative and quantitative results on multiple tasks and standard benchmarks.

Semantic Palette: Guiding Scene Generation with Class Proportions
Guillaume Le Moing, Tuan-Hung Vu, Himalaya Jain, Patrick Pérez, Matthieu Cord
CVPR, 2021.
@inproceedings{lemoing2021palette,
  title     = {Semantic Palette: Guiding Scene Generation with Class Proportions},
  author    = {Guillaume Le Moing and Tuan-Hung Vu and Himalaya Jain and Patrick P{\'e}rez and Matthieu Cord},
  booktitle = {CVPR},
  year      = {2021}
}

Despite the recent progress of generative adversarial networks (GANs) at synthesizing photo-realistic images, producing complex urban scenes remains a challenging problem. Previous works break down scene generation into two consecutive phases: unconditional semantic layout synthesis and image synthesis conditioned on layouts. In this work, we propose to condition layout generation as well for higher semantic control: given a vector of class proportions, we generate layouts with matching composition. To this end, we introduce a conditional framework with novel architecture designs and learning objectives, which effectively accommodates class proportions to guide the scene generation process. The proposed architecture also allows partial layout editing with interesting applications. Thanks to the semantic control, we can produce layouts close to the real distribution, helping enhance the whole scene generation process. On different metrics and urban scene benchmarks, our models outperform existing baselines. Moreover, we demonstrate the merit of our approach for data augmentation: semantic segmenters trained on real layout-image pairs along with additional ones generated by our approach outperform models only trained on real pairs.

Data-Efficient Framework for Real-world Multiple Sound Source 2D Localization
Guillaume Le Moing, Phongtharin Vinayavekhin, Don Joven Agravante, Tadanobu Inoue, Jayakorn Vongkulbhisal, Asim Munawar, Ryuki Tachibana
ICASSP, 2021.
@inproceedings{lemoing2021ssl,
  title     = {Data-Efficient Framework for Real-world Multiple Sound Source 2D Localization},
  author    = {Guillaume Le Moing and Phongtharin Vinayavekhin and Don Joven Agravante and Tadanobu Inoue and Jayakorn Vongkulbhisal and Asim Munawar and Ryuki Tachibana},
  booktitle = {ICASSP},
  year      = {2021}
}
Deep neural networks have recently led to promising results for the task of multiple sound source localization. Yet, they require a lot of training data to cover a variety of acoustic conditions and microphone array layouts. One can leverage acoustic simulators to inexpensively generate labeled training data. However, models trained on synthetic data tend to perform poorly with real-world recordings due to the domain mismatch. Moreover, learning for different microphone array layouts makes the task more complicated due to the infinite number of possible layouts. We propose to use adversarial learning methods to close the gap between synthetic and real domains. Our novel ensemble-discrimination method significantly improves the localization performance without requiring any label from the real data. Furthermore, we propose a novel explicit transformation layer to be embedded in the localization architecture. It enables the model to be trained with data from specific microphone array layouts while generalizing well to unseen layouts during inference.

Teaching

Fall 2022
Teaching Assistant - Object Recognition and Computer Vision - MVA Master - ENS Paris-Saclay (50 hours)
Fall 2021
Project Advisor - Object Recognition and Computer Vision - MVA Master - ENS Paris-Saclay (volunteering)

Copyright © Guillaume Le Moing  /  Last update: November 2022