posted on 2020-08-01, 00:00authored byMehrdad Alizadeh
Visual Question Answering (VQA) concerns providing answers to natural language questions about images. Several deep neural network approaches have been proposed to model the task in an end-to-end fashion. Whereas the task is grounded in visual processing, given a complex free form question the language understanding component becomes crucial. In this work, I hypothesize that if the question focuses on events described by verbs, then the model should be aware of verb semantics, as expressed via semantic role labels, argument types, and/or frame elements. Unfortunately, no VQA dataset exists that includes verb semantic information. My first contribution is a new VQA dataset (imSituVQA) that I built by taking advantage of the imSitu annotations. The imSitu dataset consists of images manually labeled with semantic frame elements, mostly taken from FrameNet. Second, I propose a multi-task CNN-LSTM VQA model that learns to classify the answers as well as the semantic frame elements. The
experiments on imSituVQA show that semantic frame element classification helps the VQA system avoid inconsistent responses and improves performance.
Semantic role labeling is an alternative solution to approximately annotate any VQA dataset of interest. I employed a PropBank based semantic role labeler to label a subset of the VQA dataset (VQAsub). Then I trained the proposed multi-task CNN-LSTM model with VQAsub. The results show a slight improvement over the single-task CNN-LSTM model.
History
Advisor
Di Eugenio , Barbara
Chair
Di Eugenio , Barbara
Department
Computer Science
Degree Grantor
University of Illinois at Chicago
Degree Level
Doctoral
Degree name
PhD, Doctor of Philosophy
Committee Member
Parde, Natalie
Caragea, Cornelia
Ziebart, Brian
Enis Cetin, Ahmet