Antoine Yang

PhD student

I am a first-year PhD student in the WILLOW team of Inria Paris and École Normale Supérieure (ENS), advised by Ivan Laptev and Cordelia Schmid. My research is focused on cross-modal learning between vision and language for video understanding. I received an engineering degree from École Polytechnique and a MSc degree in Mathematics, Vision and Learning from ENS Paris-Saclay with highest honors and jury congratulations in 2020. See my LinkedIn profile for a full curriculum vitae.


09 / 2020
I am starting my PhD at Inria WILLOW
09 / 2020
I have received my MSc degree with highest honors and jury congratulations from ENS Paris-Saclay
04 / 2020
I am starting a 5-month research internship at Inria WILLOW in Paris
12  / 2019
ICLR'20 paper accepted!
09 / 2019
04 / 2019
I am starting a 5-month research internship at Huawei Noah's Ark Lab in London


See my Google Scholar profile for a full list of publications.

[New] Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid
arXiv, 2020.
title={Just Ask: Learning to Answer Questions from Millions of Narrated Videos},
author={Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
journal={arXiv preprint arXiv:2012.00451},

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.

NAS evaluation is frustratingly hard
Antoine Yang, Pedro M. Esperança, and Fabio Maria Carlucci
ICLR, 2020.
title={NAS evaluation is frustratingly hard},
author={Antoine Yang and Pedro M. Esperança and Fabio M. Carlucci},
booktitle={International Conference on Learning Representations},
Neural Architecture Search (NAS) is an exciting new field which promises to be as much as a game-changer as Convolutional Neural Networks were in 2012. Despite many great works leading to substantial improvements on a variety of tasks, comparison between different methods is still very much an open issue. While most algorithms are tested on the same datasets, there is no shared experimental protocol followed by all. As such, and due to the under-use of ablation studies, there is a lack of clarity regarding why certain methods are more effective than others. Our first contribution is a benchmark of 8 NAS methods on 5 datasets. To overcome the hurdle of comparing methods with different search spaces, we propose using a method’s relative improvement over the randomly sampled average architecture, which effectively removes advantages arising from expertly engineered search spaces or training protocols. Surprisingly, we find that many NAS techniques struggle to significantly beat the average architecture baseline. We perform further experiments with the commonly used DARTS search space in order to understand the contribution of each component in the NAS pipeline. These experiments highlight that: (i) the use of tricks in the evaluation protocol has a predominant impact on the reported performance of architectures; (ii) the cell-based search space has a very narrow accuracy range, such that the seed has a considerable impact on architecture rankings; (iii) the hand-designed macrostructure (cells) is more important than the searched micro-structure (operations); and (iv) the depth-gap is a real phenomenon, evidenced by the change in rankings between 8 and 20 cell architectures. To conclude, we suggest best practices, that we hope will prove useful for the community and help mitigate current NAS pitfalls, e.g. difficulties in reproducibility and comparison of search methods. The code used is available at


I also regularly give private lessons in mathematics for undergraduate students.
Spring 2021
  Differential equations, Teacher assistant - Undergraduate level (L2) - Sorbonne Université
Fall      2020
  Object Recognition and Computer Vision, Project advisor - Master level (MVA) - ENS Paris-Saclay
2019  - 2020
  Mathematics, Oral examiner - Undergraduate level (MPSI) - Lycée Marcelin Berthelot
Fall       2019
  Functional programming, Tutor - Undergraduate level (BSc) - École Polytechnique
2017  -  2018
  Mathematics, Oral examiner - Undergraduate level (MPSI) - Lycée Marcelin Berthelot
Spring  2017
  Multidisciplinary support, Socio-educational facilitator intern - Middle school - Collège Saint-Charles

Copyright © Antoine Yang  /  Last update: March 2021