Antoine Yang

PhD student, Inria

I am a student researcher at Google, and a PhD student in the WILLOW team of Inria and École Normale Supérieure, advised by Antoine Miech, Josef Sivic, Ivan Laptev and Cordelia Schmid. My current research is focused on learning multi-modal video representations using vision and language. I received an engineering degree from École Polytechnique and a MSc degree in Mathematics, Vision and Learning from ENS Paris-Saclay in 2020. See my LinkedIn profile for a full resume.


News

09 / 2022
FrozenBiLM is accepted at NeurIPS 2022!
06 / 2022
I am starting a 6-month research internship at Google Research in Grenoble.
04 / 2022
03 / 2022
TubeDETR is accepted at CVPR 2022 as an oral!
07 / 2021
Just Ask is accepted at ICCV 2021 as an oral!
06 / 2021
09 / 2020
I am starting my PhD at Inria WILLOW.
09 / 2020
I have received a MSc degree with highest honors and jury congratulations from ENS Paris-Saclay.
04 / 2020
I am starting a 5-month research internship at Inria WILLOW in Paris.
12  / 2019
09 / 2019
04 / 2019
I am starting a 5-month research internship at Huawei Noah's Ark Lab in London.

Research

See my Google Scholar and GitHub profiles for more information.

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid
NeurIPS 2022
@inproceedings{yang2022frozenbilm,
title = {Zero-Shot Video Question Answering via Frozen Bidirectional Language Models},
author = {Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
booktitle={Advances in Neural Information Processing Systems}
year = {2022},
}

Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting.

Learning to Answer Visual Questions from Web Videos (journal extension of Just Ask)
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid
To appear in TPAMI (Special Issue on the Best Papers of ICCV 2021)
@article{yang2022learningta,
title={Learning to Answer Visual Questions from Web Videos},
author={Antoine Yang and Antoine Miech and Josef Sivic and Ivan Laptev and Cordelia Schmid},
journal={IEEE transactions on pattern analysis and machine intelligence},
year={2022},
volume={PP}
}

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our approach generalizes to another source of web video and text data. We generate the WebVidVQA3M dataset from videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.

TubeDETR: Spatio-Temporal Video Grounding with Transformers
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid
CVPR 2022 (oral: top 4% submissions)
@inproceedings{yang2022tubedetr,
author    = {Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
title     = {TubeDETR: Spatio-Temporal Video Grounding With Transformers},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year      = {2022},
pages     = {16442-16453}}

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our model notably includes: (i) an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and (ii) a space-time decoder that jointly performs spatio-temporal localization. We demonstrate the advantage of our proposed components through an extensive ablation study. We also evaluate our full approach on the spatio-temporal video grounding task and demonstrate improvements over the state of the art on the challenging VidSTG and HC-STVG benchmarks.

Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid
ICCV 2021 (oral: top 3% submissions)
@inproceedings{yang2021justask,
title={Just ask: Learning to answer questions from millions of narrated videos},
author={Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={1686--1697},
year={2021}}

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.

NAS evaluation is frustratingly hard
Antoine Yang, Pedro M. Esperança, and Fabio Maria Carlucci
ICLR 2020
@inproceedings{yang2020nasefh,
title={NAS evaluation is frustratingly hard},
author={Antoine Yang and Pedro M. Esperança and Fabio M. Carlucci},
booktitle={International Conference on Learning Representations},
year={2020},
url={https://openreview.net/forum?id=HygrdpVKvr}}
Neural Architecture Search (NAS) is an exciting new field which promises to be as much as a game-changer as Convolutional Neural Networks were in 2012. Despite many great works leading to substantial improvements on a variety of tasks, comparison between different methods is still very much an open issue. While most algorithms are tested on the same datasets, there is no shared experimental protocol followed by all. As such, and due to the under-use of ablation studies, there is a lack of clarity regarding why certain methods are more effective than others. Our first contribution is a benchmark of 8 NAS methods on 5 datasets. To overcome the hurdle of comparing methods with different search spaces, we propose using a method’s relative improvement over the randomly sampled average architecture, which effectively removes advantages arising from expertly engineered search spaces or training protocols. Surprisingly, we find that many NAS techniques struggle to significantly beat the average architecture baseline. We perform further experiments with the commonly used DARTS search space in order to understand the contribution of each component in the NAS pipeline. These experiments highlight that: (i) the use of tricks in the evaluation protocol has a predominant impact on the reported performance of architectures; (ii) the cell-based search space has a very narrow accuracy range, such that the seed has a considerable impact on architecture rankings; (iii) the hand-designed macrostructure (cells) is more important than the searched micro-structure (operations); and (iv) the depth-gap is a real phenomenon, evidenced by the change in rankings between 8 and 20 cell architectures. To conclude, we suggest best practices, that we hope will prove useful for the community and help mitigate current NAS pitfalls, e.g. difficulties in reproducibility and comparison of search methods.
MANAS: Multi-Agent Neural Architecture Search
Fabio Maria Carlucci, Pedro M. Esperança, Marco Singh, Victor Gabillon, Antoine Yang, Hang Xu, Zewei Chen and Jun Wang
arXiv 2019
@article{carlucci2019manas,
  title={MANAS: multi-agent neural architecture search},
  author={Carlucci, Fabio Maria and Esperan{\c{c}}a, Pedro M and Singh, Marco and Gabillon, Victor and Yang, Antoine and Xu, Hang and Chen, Zewei and Wang, Jun},
  journal={arXiv preprint arXiv:1909.01051},
  year={2019}
}
The Neural Architecture Search (NAS) problem is typically formulated as a graph search problem where the goal is to learn the optimal operations over edges in order to maximise a graph-level global objective. Due to the large architecture parameter space, efficiency is a key bottleneck preventing NAS from its practical use. In this paper, we address the issue by framing NAS as a multi-agent problem where agents control a subset of the network and coordinate to reach optimal architectures. We provide two distinct lightweight implementations, with reduced memory requirements (1/8th of state-of-the-art), and performances above those of much more computationally expensive methods. Theoretically, we demonstrate vanishing regrets of the form O(sqrt(T)), with T being the total number of rounds. Finally, aware that random search is an, often ignored, effective baseline we perform additional experiments on 3 alternative datasets and 2 network configurations, and achieve favourable results in comparison.

Talks


Teaching

Fall      2022
  Object Recognition and Computer Vision, Teacher Assistant - Master level (MVA) - 50 hours - ENS Paris-Saclay
Fall      2021
  Object Recognition and Computer Vision, Project advisor - Master level (MVA) - Volunteering - ENS Paris-Saclay
2021  - 2022
  Mathematics, Oral examiner - Undergraduate level (MPSI, MP and MP*) - 80 hours - Lycée Marcelin Berthelot
Spring 2021
  Differential equations, Teacher assistant - Undergraduate level (L2) - 38 hours - Sorbonne Université
Fall      2020
  Object Recognition and Computer Vision, Project advisor - Master level (MVA) - Volunteering - ENS Paris-Saclay
2019  - 2020
  Mathematics, Oral examiner - Undergraduate level (MPSI) - 60 hours - Lycée Marcelin Berthelot
Fall       2019
  Functional programming, Tutor - Undergraduate level (BSc) - 24 hours - École Polytechnique
2017  -  2018
  Mathematics, Oral examiner - Undergraduate level (MPSI) - 60 hours - Lycée Marcelin Berthelot
Spring  2017
  Multidisciplinary support, Socio-educational facilitator intern - Middle school - Collège Saint-Charles

Misc.

Reviewer for CVPR 2022, ECCV 2022 and CVPR 2023.

Copyright © Antoine Yang  /  Last update: November 2022