Wednesday, November 17, 2021 - 1:00pm to 2:00pm
CORD 1109 or Zoom https://oregonstate.zoom.us/j/93591935144?pwd=YjZaSjBYS0NmNUtjQzBEdzhPeDZ5UT09

Speaker Information

Stefan Lee
Assistant Professor
Computer Science
Oregon State University

Abstract

Vision-and-language research consists of a diverse set of tasks. While underlying image datasets, input/output APIs, and model architectures vary across tasks, there exists a common need to associate imagery and text -- i.e. to perform visual grounding. Despite this, the standard paradigm has been to treat each task in isolation -- starting from separately pretrained vision and language models and then learning to associate their outputs as part of task training. This siloed approach fails to leverage grounding supervision between tasks and can result in myopic groundings when datasets are small or biased. In this talk, I'll discuss a line of work focused on learning task-agnostic visiolinguistic representations that can serve as a common foundation for many vision-and-language tasks. First, I'll cover recent work on learning a generic multimodal encoder (ViLBERT) from large-scale web data and transferring this pretrained model to a range of vision-and-language tasks. Second, I'll show how multitask training from this base architecture further improves task performance while unifying twelve vision-and-language tasks in a single model.

Speaker Bio

Stefan Lee is an assistant professor in the School of Electrical Engineering and Computer Science at Oregon State University and a member of the Collaborative Robotics and Intelligent Systems (CoRIS) Institute there. His work addresses problems at the intersection of computer vision and natural language processing. He is the recipient of the DARPA Rising Research Plenary Speaker Selection (DARPA 2019), two best paper awards (EMNLP 2017, CVPR 2014 Workshop on Egocentric Vision), and multiple awards for review quality (CVPR 2017,2019,2020; ICCV 2017; NeurIPS 2017-2018; ICLR 2018-2019, ECCV 2020).