BELLINI-THESIS-2020.pdf (22.62 MB)

Towards Open-Ended VQA Models using Transformers

Download (22.62 MB)
thesis
posted on 01.05.2020, 00:00 by Alberto Mario Bellini
In this work, we introduce a new architecture to address the Visual Question Answering problem, an open field of research in the NLP and Vision community. In the last few years, with the advent of Deep Learning and the exponential growth of computing power, researches came up with brilliant solutions to tackle the problem. However, most of the related work share a standard limitation: the number of possible answers is usually restricted to a limited set of candidates, limiting the power of such models. In this work, we describe a new architecture that employs new state-of-the-art language models, such as the Transformer, to generate open-ended answers. In the end, our contribution to the scientific community lies in a new approach that allows VQA systems to generate unconstrained answers. First, we introduce the necessary background as well as the most critical computational models to deal with text and images. Ultimately, we show that our architecture compares well with other VQA models, setting a new baseline for future work.

History

Advisor

Parde, Natalie

Chair

Parde, Natalie

Department

CS

Degree Grantor

University of Illinois at Chicago

Degree Level

Masters

Degree name

MS, Master of Science

Committee Member

Di Eugenio, Barbara Tang, Wei Lanzi, Pier Luca

Submitted date

May 2020

Thesis type

application/pdf

Language

en

Exports

Categories

Exports