Enhancing Visual Question Answering with Linguistic Information
thesisposted on 01.08.2020, 00:00 by Mehrdad Alizadeh
Visual Question Answering (VQA) concerns providing answers to natural language questions about images. Several deep neural network approaches have been proposed to model the task in an end-to-end fashion. Whereas the task is grounded in visual processing, given a complex free form question the language understanding component becomes crucial. In this work, I hypothesize that if the question focuses on events described by verbs, then the model should be aware of verb semantics, as expressed via semantic role labels, argument types, and/or frame elements. Unfortunately, no VQA dataset exists that includes verb semantic information. My first contribution is a new VQA dataset (imSituVQA) that I built by taking advantage of the imSitu annotations. The imSitu dataset consists of images manually labeled with semantic frame elements, mostly taken from FrameNet. Second, I propose a multi-task CNN-LSTM VQA model that learns to classify the answers as well as the semantic frame elements. The experiments on imSituVQA show that semantic frame element classification helps the VQA system avoid inconsistent responses and improves performance. Semantic role labeling is an alternative solution to approximately annotate any VQA dataset of interest. I employed a PropBank based semantic role labeler to label a subset of the VQA dataset (VQAsub). Then I trained the proposed multi-task CNN-LSTM model with VQAsub. The results show a slight improvement over the single-task CNN-LSTM model.